Obtain a bunch of leading indicators, which should be highly correlated with UK inflation or UK inflation expectations.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import pickle
Sample of 844,249 tweets on inflation in the UK for the period 2018-2022. The file has 27 columns.
data = pd.read_parquet('part-00000-64a259cf-ea32-483e-b267-f4a0854c7dc3-c000.snappy.parquet')
data.shape
(844249, 27)
data.columns
Index(['id', 'verb', 'user', 'link', 'body', 'retweetbody', 'date',
'postedtime', 'retweetcount', 'favoritescount', 'generator',
'twitter_lang', 'year', 'userLocation', 'userFriends', 'userFollowers',
'userNumTweets', 'userVerified', 'userLanguage', 'userBio', 'country',
'countrycode', 'locality', 'region', 'subregion', 'locationname',
'inreplyto.link'],
dtype='object')
data['date'].sort_values()
33786 2018-01-01
113785 2018-01-01
117986 2018-01-01
785283 2018-01-01
215226 2018-01-01
...
33781 2022-12-31
33782 2022-12-31
33783 2022-12-31
147534 2022-12-31
299479 2022-12-31
Name: date, Length: 844249, dtype: object
data.iloc[0]
id -1412257600 verb post user -931661990 link http://twitter.com/ahlaamomarr/statuses/100070... body After successfully making iftar for your whole... retweetbody None date 2018-05-27 postedtime 2018-05-27T11:42:52.000Z retweetcount 0 favoritescount 0 generator Twitter for iPhone twitter_lang en year 2018 userLocation London, England userFriends 145 userFollowers 232 userNumTweets 9408 userVerified False userLanguage None userBio sabr country United Kingdom countrycode GB locality London region England subregion Greater London locationname London, England, United Kingdom inreplyto.link None Name: 0, dtype: object
data.iloc[1]
id 531418800 verb share user 1115606588 link http://twitter.com/platypusbanker/statuses/100... body RT @BrexitBin: If you still think this is an a... retweetbody If you still think this is an acceptable price... date 2018-05-28 postedtime 2018-05-28T14:30:42.000Z retweetcount 87 favoritescount 0 generator Twitter for iPhone twitter_lang en year 2018 userLocation South East, England userFriends 402 userFollowers 123 userNumTweets 5952 userVerified False userLanguage None userBio None country United Kingdom countrycode GB locality South Ockendon region England subregion Essex locationname South Ockendon, England, United Kingdom inreplyto.link None Name: 1, dtype: object
data['verb'].value_counts()
share 493552 post 350697 Name: verb, dtype: int64
data['link'].value_counts()
http://twitter.com/ahlaamomarr/statuses/1000703881368821760 1
http://twitter.com/lcaller/statuses/1471510670323511299 1
http://twitter.com/Knuckle97716851/statuses/1470762090432503817 1
http://twitter.com/pwayman/statuses/1470795254328664070 1
http://twitter.com/7_StarGirlx/statuses/1470978679782129664 1
..
http://twitter.com/ImaniDH_/statuses/1544630672265994240 1
http://twitter.com/louorns/statuses/1544642295009517569 1
http://twitter.com/DualAspectGlass/statuses/1544682838846472194 1
http://twitter.com/boutiqueheathe1/statuses/1544687877820391424 1
http://twitter.com/Forster06/statuses/999751863745511424 1
Name: link, Length: 844249, dtype: int64
data['generator'].value_counts()
Twitter for iPhone 280860
Twitter for Android 234807
Twitter Web App 165878
Twitter for iPad 47807
Twitter Web Client 26016
...
BleuPage 1
troocostJK 1
ozziapp 1
ChelseaPro 1
TweetPoll 1
Name: generator, Length: 1883, dtype: int64
data['favoritescount'].value_counts()
0 844249 Name: favoritescount, dtype: int64
data['twitter_lang'].value_counts()
en 844249 Name: twitter_lang, dtype: int64
data['userLocation'].value_counts()
London, England 78818
United Kingdom 62627
London 62299
England, United Kingdom 42198
UK 39182
...
Tredegar, Wales, UK 1
woodstock. 1
the end 1
Ponteland, Northumberland 1
Sherwood Nottingham 1
Name: userLocation, Length: 28335, dtype: int64
data['userLanguage'].value_counts()
Series([], Name: userLanguage, dtype: int64)
data['userBio'].value_counts()
We are the UK's #1 commission and surcharge free heating oil quote website connecting heating oil consumers with 210+ heating oil suppliers. 3123
TVCables is part of Nimbus Designs Ltd, established in 1980. We are the best known internet cable retailer in the UK. 2523
Exchange turnip price and make friends with real-time chat! (Turn on notification time to get latest best prices😉) #AnimalCrossing #TurnipsExchange 2022
Reporting changes to #FTSE & #AIM shares as well as director trades (#Directortrade). Now showing #Crypto, #Commodities and #Currency exchange rates. 1602
Photographer - Occasional blogger.\nWhy not LIKE my FB page here: http://t.co/Sj7or8AK My Etsy Shop - http://t.co/QVi7Xyoxgb 791
...
#ShopLocal / #BuyBritish / Father, son, & unholy spirit / Politics geek / National conservatism / PPE student / Host of #BurkeanPaine / Former News Editor 1
Accepting academic commissions\n(maths and sciences.chem, physics, labs, projects, calculus, geometry, stats etc)\npaypal me onlinebestessays2@gmail.com \n20%free! 1
Celebrating traditional beer and pubs, pub cats and classic rock jukeboxes. Lockdown sceptic. Opponent of the Nanny State in all its forms. 1
people say that money isn’t everything— but I’d like to see you live without it.. 🎵 1
Supporter of Newcastle United, Morpeth Town AFC, Newcastle Falcons, love Boxing and good music ⚽️🍻🏉🎸👊 1
Name: userBio, Length: 347133, dtype: int64
data['country'].value_counts()
United Kingdom 844249 Name: country, dtype: int64
data['countrycode'].value_counts()
GB 844249 Name: countrycode, dtype: int64
data['locality'].value_counts()
London 188012
Manchester 20978
Glasgow 14445
South Ockendon 13839
Yorkshire 12848
...
Meppershall 1
Tintagel 1
Downpatrick 1
Knighton 1
Bampton 1
Name: locality, Length: 2109, dtype: int64
data['region'].value_counts()
England 591164 Scotland 77150 Wales 25761 N Ireland 10415 Name: region, dtype: int64
data['subregion'].value_counts()
Greater London 206743
City and Borough of Manchester 21129
Essex 18289
Glasgow City 14445
City and Borough of Liverpool 12209
...
North Down District 3
Magherafelt District 3
Ballymoney District 1
Cookstown District 1
Dungannon District 1
Name: subregion, Length: 199, dtype: int64
data['locationname'].value_counts()
London, England, United Kingdom 188012
United Kingdom 139759
England, United Kingdom 72032
Scotland, United Kingdom 29560
Manchester, England, United Kingdom 20978
...
Burtonwood, England, United Kingdom 1
Chobham, England, United Kingdom 1
Sutton Bonington, England, United Kingdom 1
Sturminster Newton, England, United Kingdom 1
Bampton, England, United Kingdom 1
Name: locationname, Length: 2122, dtype: int64
data['inreplyto.link'].value_counts()
http://twitter.com/IsabelOakeshott/statuses/1527199067666784256 28
http://twitter.com/RishiSunak/statuses/1506692275216265219 25
http://twitter.com/DPJHodges/statuses/1592821196495917056 24
http://twitter.com/Conservatives/statuses/1597961261199011840 23
http://twitter.com/DHSCgovuk/statuses/1590723453899898883 16
..
http://twitter.com/GraemeBrogan/statuses/1102714253713309697 1
http://twitter.com/mully1410/statuses/1100379533394526208 1
http://twitter.com/evilblonderobot/statuses/1101961846200983552 1
http://twitter.com/CaliforniaJoe01/statuses/1101397671514959872 1
http://twitter.com/LeopoldHeinrich/statuses/997465650892365824 1
Name: inreplyto.link, Length: 143831, dtype: int64
data.isna().sum()
id 0 verb 0 user 0 link 0 body 0 retweetbody 350697 date 0 postedtime 0 retweetcount 0 favoritescount 0 generator 0 twitter_lang 0 year 0 userLocation 0 userFriends 0 userFollowers 0 userNumTweets 0 userVerified 0 userLanguage 844249 userBio 78205 country 0 countrycode 0 locality 259078 region 139759 subregion 273556 locationname 0 inreplyto.link 695334 dtype: int64
data_cleaned = data.copy(deep=True)
data_cleaned = data_cleaned.drop(columns=['id', 'user', 'favoritescount',
'link', 'twitter_lang',
'userLanguage', 'country',
'countrycode', 'locationname',
'inreplyto.link'])
data_cleaned['verb'] = data_cleaned['verb'].apply(lambda x: 0 if x=='post' else 1)
data_cleaned['userVerified'] = data_cleaned['userVerified'].apply(lambda x: x*1)
data_cleaned['postedtime'] = pd.to_datetime(data_cleaned['postedtime'])
data_cleaned = data_cleaned.sort_values('postedtime')
data_cleaned = data_cleaned.set_index('postedtime')
data_cleaned['Month'] = pd.to_datetime(data_cleaned.date.apply(lambda x: x[:-3]))
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
import string
# Define stopwords
stop_words = set(stopwords.words('english'))
# Preprocess function
def preprocess_tweet(tweet):
# Convert to lowercase
tweet = tweet.lower()
# Remove urls
tweet = re.sub(r'http\S+|www\S+|https\S+', '', tweet, flags=re.MULTILINE)
# Remove user @ references and '#' from tweet
tweet = re.sub(r'\@\w+|\#','', tweet)
# Remove punctuations
tweet = tweet.translate(str.maketrans('', '', string.punctuation))
# Tokenize the tweet
tokens = word_tokenize(tweet)
# Remove stopwords and stem the words
ps = PorterStemmer()
tokens = [ps.stem(token) for token in tokens if token not in stop_words]
processed_text = " ".join(tokens)
return processed_text, tokens
# Apply the preprocessing function to the tweets
data_cleaned[['processed_tweet', 'tweet_tokens']] = data_cleaned['body'].apply(preprocess_tweet).apply(pd.Series)
[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\AhmedOmar\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package punkt to [nltk_data] C:\Users\AhmedOmar\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date!
import pandas as pd
from textblob import TextBlob
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Function to perform sentiment analysis with a focus on inflation-related tweets
def get_sentiment_with_inflation_context(text):
analysis = TextBlob(text)
# Check for the presence of inflation-related keywords in the tweet
# Updated list of inflation-related keywords and phrases with Twitter-specific terms
inflation_keywords = [
'inflation', 'price increase', 'rising prices', 'monetary policy',
'central bank', 'CPI', 'consumer price index', 'PPI',
'producer price index', 'economic growth', 'cost of living',
'interest rates', 'inflation rate', 'hyperinflation', 'deflation',
'stagflation', 'inflationary pressures', 'purchasing power',
'monetary tightening', 'monetary easing', 'quantitative easing',
'currency devaluation', 'inflation target',
'inflationary expectations', 'core inflation', 'food inflation',
'fuel inflation', 'rental inflation', 'wage inflation',
'imported inflation', 'cost-push inflation',
'demand-pull inflation', 'structural inflation', 'inflation hedge',
'inflation risk', 'inflationary environment',
'inflationary spiral', 'inflation-adjusted', 'inflationary trends',
'economy', 'growth', 'recession', 'financial crisis', 'market',
'fed', 'federal reserve', 'economist', 'fiscal policy',
'monetary stimulus', 'market volatility', 'stock market',
'unemployment', 'interest rate hike', 'interest rate cut',
"purchasing managers' index", 'PMI', 'economic indicators',
'economic outlook', 'inflation expectations', 'economic data',
'currency', 'economic recovery', 'economic uncertainty',
'economic analysis', 'economic performance', 'business cycle',
'economics', 'monetary policy meeting', 'financial markets',
'global economy', 'economic trends', 'economic forecast',
'economic news', 'monetary policy decisions', 'economic stimulus',
'inflation fears', 'economic report', 'central bank action',
'economic growth rate', 'economic impact', 'economic development',
'economic conditions', 'econometrics', 'economic models',
'economics research', 'economic data analysis', 'economic policy',
'economy news', 'economy analysis', 'economy performance',
'economy forecast', 'economy data', 'fed meeting',
'interest rate decisions', 'interest rate changes',
'monetary policy tools', 'monetary policy actions',
'interest rate movements', 'federal reserve actions',
'economist views'
]
inflation_related = any(keyword in text for keyword in inflation_keywords)
polarity = analysis.sentiment.polarity
# Calculate sentiment polarity for inflation-related tweets
# and tweets not related to inflation.
# Assign sentiment labels based on the polarity score
if polarity > 0:
sentiment = "Positive"
elif polarity < 0:
sentiment = "Negative"
else:
sentiment = "Neutral"
return inflation_related, sentiment, polarity
# Perform sentiment analysis for each tweet and store the sentiment labels in a new column
data_cleaned[['inflation_related', 'sentiment', 'sentiment_score']] = data_cleaned['processed_tweet'].apply(get_sentiment_with_inflation_context).apply(pd.Series)
data_cleaned_inflation = data_cleaned[data_cleaned.inflation_related].copy(deep=True)
data_cleaned_inflation.to_csv('tweets_cleaned_inflation_related.csv')
# List of keywords related to professionals who can talk about inflation
inflation_professionals_keywords = [
'economist', 'financial analyst', 'central bank economist',
'macroeconomist', 'monetary policy expert', 'economic researcher',
'economic consultant', 'inflation analyst', 'financial forecaster',
'macro analyst', 'market strategist', 'economic journalist',
'financial reporter', 'economic commentator', 'economic expert',
'economic professor', 'economic advisor', 'economic policymaker',
'economic director', 'economic specialist', 'economic writer',
'economic blogger', 'economic influencer', 'economic speaker',
'economic educator', 'economic scientisteconomist',
'economic scientist', 'banking analyst', 'investment analyst',
'financial economist', 'financial planner', 'financial researcher',
'economic strategist', 'business economist', 'economic modeler',
'policy economist', 'government economist', 'academic economist',
'data economist', 'international economist',
'fiscal policy analyst', 'financial commentator',
'investment strategist', 'investment manager',
'financial consultant', 'financial expert', 'market economist',
'economic policy analyst', 'economic risk analyst',
'economic data analyst', 'quantitative economist',
'financial market analyst', 'economic development specialist',
'economic planning analyst', 'economic growth analyst',
'monetary policy researcher', 'economic affairs director',
'economic indicators analyst', 'economic forecasting analyst',
'financial markets researcher', 'inflation expectations analyst',
'economic data researcher', 'economic performance analyst',
'economic trends analyst', 'economic news reporter',
'economic data journalist', 'inflation research specialist',
'economic policy advisor', 'macroeconomic trends analyst',
'financial markets strategist', 'economic modeling researcher',
'economic outlook commentator', 'economic trends commentator',
'economic report analyst', 'economic data expert',
'inflationary trends researcher', 'economic data forecaster',
'economic impact consultant', 'economic risk researcher',
'economic policy specialist', 'financial markets expert',
'economic commentary writer', 'macroeconomic research specialist',
'economic development economist', 'economic planning consultant',
'financial market trends analyst', 'economic indicators expert',
'economic forecasting specialist', 'economic analysis commentator',
'economic trends researcher', 'economic news journalist',
'financial markets commentator',
'inflation expectations researcher',
'economic data analysis specialist',
'economic performance researcher', 'economic policy researcher',
'economic modeling specialist', 'economic outlook advisor',
'economic trends specialist', 'economic report researcher',
'inflation analysis expert', 'macroeconomic trends specialist',
'monetary economist', 'market analyst', 'portfolio manager',
'asset manager', 'wealth manager', 'financial advisor',
'quantitative analyst', 'data analyst', 'financial modeler',
'risk analyst', 'credit analyst', 'forensic economist',
'behavioral economist', 'health economist', 'labor economist',
'environmental economist', 'energy economist',
'agricultural economist', 'development economist',
'public finance economist', 'fiscal economist',
'financial policy expert', 'economic historian',
'financial historian', 'economic sociologist',
'economic anthropologist', 'economic geographer',
'economic demographer', 'economic statistician',
'economic data scientist', 'financial data scientist',
'economic futurist', 'financial policy analyst',
'economic impact analyst', 'economic sustainability expert',
'economic inequality researcher', 'economic growth specialist',
'economic trade analyst', 'economic market researcher',
'economic valuation expert', 'economic regulation specialist',
'economic reform analyst', 'economic planning expert',
'economic globalization researcher',
'economic public relations expert',
'economic crisis management specialist',
'economic recovery strategist', 'economic ethics expert',
'economic behavioral scientist', 'economic cognitive psychologist',
'economic game theorist', 'economic neuroeconomist',
'economic decision scientist', 'economic social scientist',
'economic cultural anthropologist',
'economic organizational sociologist', 'economic urban geographer',
'economic rural demographer',
'economic environmental statistician',
'economic big data scientist', 'economic climate futurist',
'economic technology strategist', 'economic AI analyst',
'economic machine learning specialist',
'economic blockchain researcher', 'economic cryptocurrency expert',
'economic sustainable development specialist',
'economic circular economy analyst', 'economic fintech strategist',
'economic digital transformation consultant',
'economic ESG analyst', 'economic impact investing expert',
'economic green finance consultant',
'economic regenerative agriculture researcher',
'economic clean energy specialist',
'economic social enterprise expert',
'economic remote work researcher', 'economic gig economy analyst',
'economic supply chain strategist',
'economic logistics specialist', 'economic emerging market expert',
'economic startup advisor', 'economic venture capitalist',
'economic angel investor', 'economic crowdfunding specialist',
'economic real estate economist',
'economic property market analyst',
'economic housing policy expert',
'economic transportation economist',
'economic infrastructure specialist',
'economic public sector consultant',
'economic private sector advisor',
'economic non-profit researcher', 'economic healthcare economist',
'economic education economist', 'economic trade union strategist',
'economic financial inclusion expert',
'economic corporate governance analyst',
'economic taxation specialist', 'economic tax policy researcher',
'economic labor market economist', 'economic human capital expert',
'economic gender equality strategist',
'economic diversity and inclusion consultant',
'economic public health specialist',
'economic mental health economist',
'economic social welfare researcher',
'economic public policy analyst',
'economic sustainable business strategist',
'economic circular economy expert',
'economic innovation economist',
'economic entrepreneurship specialist',
'economic AI and automation researcher',
'economic digital economy analyst', 'economic trade policy expert',
'economic climate change economist',
'economic disaster recovery specialist',
'economic conflict resolution analyst',
'economic peacebuilding expert',
'economic renewable energy economist',
'economic climate finance strategist',
'economic sustainable agriculture researcher',
'economic natural resource specialist',
'economic circular economy consultant',
'economic smart cities analyst', 'economic urban planning expert',
'economic rural development strategist'
]
# Apply a function to determine if a user is professional
data_cleaned['is_inflationProfessional'] = data_cleaned['userBio'].apply(lambda bio: any(keyword in str(bio).lower() for keyword in inflation_professionals_keywords))
data_cleaned_prof = data_cleaned[data_cleaned['is_inflationProfessional']].copy(deep=True)
data_cleaned_inflation_prof = data_cleaned_prof[data_cleaned_prof.inflation_related].copy(deep=True)
# data_cleaned.to_csv('tweets_cleaned_original.csv', index=True)
data_cleaned.groupby('date').count()['userNumTweets'].plot(figsize=(15,7))
plt.legend(['Frequency of Tweets'])
plt.xlabel('Day')
plt.ylabel('Number of Tweets')
plt.title('Number of Tweets per Day');
data_cleaned.groupby('Month').count()['userNumTweets'].plot(figsize=(15,7))
plt.legend(['Frequency of Tweets'])
plt.xlabel('Month')
plt.ylabel('Number of Tweets')
plt.title('Number of Tweets per Month');
plt.figure(figsize=(7,7))
plt.pie(data_cleaned.groupby('verb').count()['userNumTweets'], labels=['post', 'retweet'], autopct='%1.1f%%')
plt.title('Percentage of Posts Vs Retweets');
plt.figure(figsize=(7,7))
plt.pie(data_cleaned.groupby('userVerified').count()['userNumTweets'], labels=['No', 'Yes'], autopct='%1.1f%%')
plt.title('Percentage of Verified Users');
plt.figure(figsize=(7,7))
plt.pie(data_cleaned.region.value_counts().values,
labels=data_cleaned.region.value_counts().index,
autopct='%1.1f%%'
)
plt.title('Distribution of Tweets per Region');
plt.figure(figsize=(12,7))
attribute = data_cleaned.generator.value_counts()[:25]
plt.bar(range(len(attribute)), attribute.values)
plt.xticks(range(len(attribute)), attribute.index, rotation=90)
plt.legend(['Frequency of Generator'])
plt.title('Number of Tweets per Top 25 Generators')
plt.xlabel('Generator')
plt.ylabel('Number of Tweets')
;
''
plt.figure(figsize=(7,7))
plt.pie(data_cleaned.inflation_related.value_counts().values,
labels=['Not related to inflation', 'Related to inflation'],
autopct='%1.1f%%'
)
plt.title('Percentage of Tweets related to Inflation');
plt.figure(figsize=(7,7))
plt.pie(data_cleaned.is_inflationProfessional.value_counts().values,
labels=['Not Professional', 'Professional'],
autopct='%1.1f%%'
)
plt.title('Percentage of Tweets by Professionals');
plt.figure(figsize=(7,7))
plt.pie(data_cleaned.sentiment.value_counts().values,
labels=data_cleaned.sentiment.value_counts().index,
autopct='%1.1f%%'
)
plt.title('Distribution of Sentiment for All Tweets');
plt.figure(figsize=(7,7))
plt.pie(data_cleaned_inflation.sentiment.value_counts().values,
labels=data_cleaned_inflation.sentiment.value_counts().index,
autopct='%1.1f%%'
)
plt.title('Distribution of Sentiment for Tweets related to Inflation');
plt.figure(figsize=(7,7))
plt.pie(data_cleaned_inflation_prof.sentiment.value_counts().values,
labels=data_cleaned_inflation_prof.sentiment.value_counts().index,
autopct='%1.1f%%'
)
plt.title('Distribution of Sentiment for Tweets related to Inflation by Professionals');
def checkSentiment(x):
if x['sentiment'] == 'Positive':
return x['sentiment_score'], 0
if x['sentiment'] == 'Negative':
return 0, x['sentiment_score']
if x['sentiment'] == 'Neutral':
return 0, 0
def plotSentimentDaily(data, plt_title):
data.groupby('date').sum()['sentiment_score'].plot(figsize=(15,7), legend=True);
plt.ylabel('Sentiment Score')
plt.title(f'Sum of Compound Sentiment Score of {plt_title} per Day')
plt.show()
data.groupby('date').mean()['sentiment_score'].plot(figsize=(15,7), legend=True);
plt.ylabel('Sentiment Score')
plt.title(f'Mean of Compound Sentiment Score of {plt_title} per Day')
plt.show()
dummy = pd.DataFrame()
dummy['date'] = data['date']
dummy[['Positive', 'Negative']] = data.apply(lambda x: checkSentiment(x), axis=1).apply(pd.Series)
dummy.groupby('date').sum().plot(figsize=(15,7))
plt.ylabel('Sentiment Score')
plt.title(f'Sum of +Ve and -Ve Sentiment Scores of {plt_title} per Day')
plt.show()
dummy.groupby('date').mean().plot(figsize=(15,7))
plt.ylabel('Sentiment Score')
plt.title(f'Mean of +Ve and -Ve Sentiment Scores of {plt_title} per Day')
plt.show()
def plotSentimentMonthly(data, plt_title):
data.groupby('Month').sum()['sentiment_score'].plot(figsize=(15,7), legend=True);
plt.ylabel('Sentiment Score')
plt.title(f'Sum of Compound Sentiment Score of {plt_title} per Month')
plt.show()
data.groupby('Month').mean()['sentiment_score'].plot(figsize=(15,7), legend=True);
plt.ylabel('Sentiment Score')
plt.title(f'Mean of Compound Sentiment Score of {plt_title} per Month')
plt.show()
dummy = pd.DataFrame()
dummy['Month'] = data['Month']
dummy[['Positive', 'Negative']] = data.apply(lambda x: checkSentiment(x), axis=1).apply(pd.Series)
dummy.groupby('Month').sum().plot(figsize=(15,7))
plt.ylabel('Sentiment Score')
plt.title(f'Sum of +Ve and -Ve Sentiment Scores of {plt_title}per Month')
plt.show()
dummy.groupby('Month').mean().plot(figsize=(15,7))
plt.ylabel('Sentiment Score')
plt.title(f'Mean of +Ve and -Ve Sentiment Scores of {plt_title} per Month')
plt.show()
plotSentimentDaily(data_cleaned_inflation, 'Inflation Related Tweets')
plotSentimentMonthly(data_cleaned_inflation, 'Inflation Related Tweets')
plotSentimentMonthly(data_cleaned_inflation_prof, 'Inflation Related Tweets by Professionals')
indicators_df = pd.DataFrame()
indicators_df['Month'] = data_cleaned.Month.unique()
indicators_df['Month'] = pd.to_datetime(indicators_df['Month'])
indicators_df = indicators_df.set_index('Month')
indicators_df['countOfAllTweets'] = data_cleaned.groupby('Month').count()['userNumTweets'].values
indicators_df['sumOfCompoundSentimentForAllTweets'] = data_cleaned.groupby('Month').sum()['sentiment_score'].values
indicators_df['sumOf+veSentimentForAllTweets'] = data_cleaned[data_cleaned.sentiment=='Positive'].groupby('Month').sum()['sentiment_score'].values
indicators_df['sumOf-veSentimentForAllTweets'] = data_cleaned[data_cleaned.sentiment=='Negative'].groupby('Month').sum()['sentiment_score'].values
indicators_df['meanOfCompoundSentimentForAllTweets'] = data_cleaned.groupby('Month').mean()['sentiment_score'].values
indicators_df['meanOf+veSentimentForAllTweets'] = data_cleaned[data_cleaned.sentiment=='Positive'].groupby('Month').mean()['sentiment_score'].values
indicators_df['meanOf-veSentimentForAllTweets'] = data_cleaned[data_cleaned.sentiment=='Negative'].groupby('Month').mean()['sentiment_score'].values
indicators_df['countOfInflationTweets'] = data_cleaned_inflation.groupby('Month').count()['sentiment'].values
indicators_df['sumOfCompoundSentimentForInflationTweets'] = data_cleaned_inflation.groupby('Month').sum()['sentiment_score'].values
indicators_df['sumOf+veSentimentForInflationTweets'] = data_cleaned_inflation[data_cleaned_inflation.sentiment == 'Positive'].groupby('Month').sum()['sentiment_score'].values
indicators_df['sumOf-veSentimentForInflationTweets'] = data_cleaned_inflation[data_cleaned_inflation.sentiment == 'Negative'].groupby('Month').sum()['sentiment_score'].values
indicators_df['meanOfCompoundSentimentForInflationTweets'] = data_cleaned_inflation.groupby('Month').mean()['sentiment_score'].values
indicators_df['meanOf+veSentimentForInflationTweets'] = data_cleaned_inflation[data_cleaned_inflation.sentiment == 'Positive'].groupby('Month').mean()['sentiment_score'].values
indicators_df['meanOf-veSentimentForInflationTweets'] = data_cleaned_inflation[data_cleaned_inflation.sentiment == 'Negative'].groupby('Month').mean()['sentiment_score'].values
indicators_df['countOfProfessionalsTweets'] = data_cleaned_prof.groupby('Month').count()['sentiment'].values
indicators_df['sumOfCompoundSentimentForProfessionalsTweets'] = data_cleaned_prof.groupby('Month').sum()['sentiment_score'].values
indicators_df['sumOf+veSentimentForProfessionalsTweets'] = data_cleaned_prof[data_cleaned_prof.sentiment == 'Positive'].groupby('Month').sum()['sentiment_score'].values
indicators_df['sumOf-veSentimentForProfessionalsTweets'] = data_cleaned_prof[data_cleaned_prof.sentiment == 'Negative'].groupby('Month').sum()['sentiment_score'].values
indicators_df['meanOfCompoundSentimentForProfessionalsTweets'] = data_cleaned_prof.groupby('Month').mean()['sentiment_score'].values
indicators_df['meanOf+veSentimentForProfessionalsTweets'] = data_cleaned_prof[data_cleaned_prof.sentiment == 'Positive'].groupby('Month').mean()['sentiment_score'].values
indicators_df['meanOf-veSentimentForProfessionalsTweets'] = data_cleaned_prof[data_cleaned_prof.sentiment == 'Negative'].groupby('Month').mean()['sentiment_score'].values
indicators_df['countOfProfessionalsInflationTweets'] = data_cleaned_inflation_prof.groupby('Month').count()['sentiment'].values
indicators_df['sumOfCompoundSentimentForProfessionalsInflationTweets'] = data_cleaned_inflation_prof.groupby('Month').sum()['sentiment_score'].values
indicators_df['sumOf+veSentimentForProfessionalsInflationTweets'] = data_cleaned_inflation_prof[data_cleaned_inflation_prof.sentiment == 'Positive'].groupby('Month').sum()['sentiment_score'].values
temp = data_cleaned_inflation_prof[data_cleaned_inflation_prof.sentiment == 'Negative'].groupby('Month').sum()['sentiment_score']
indicators_df.loc[temp.index, 'sumOf-veSentimentForProfessionalsInflationTweets'] = temp.values
indicators_df['meanOfCompoundSentimentForProfessionalsInflationTweets'] = data_cleaned_inflation_prof.groupby('Month').mean()['sentiment_score'].values
indicators_df['meanOf+veSentimentForProfessionalsInflationTweets'] = data_cleaned_inflation_prof[data_cleaned_inflation_prof.sentiment == 'Positive'].groupby('Month').mean()['sentiment_score'].values
temp = data_cleaned_inflation_prof[data_cleaned_inflation_prof.sentiment == 'Negative'].groupby('Month').mean()['sentiment_score']
indicators_df.loc[temp.index, 'meanOf-veSentimentForProfessionalsInflationTweets'] = temp.values
indicators_df = indicators_df.fillna(indicators_df.mean())
from wordcloud import WordCloud
# Combine the processed tweets into a single string
combined_tweets = ' '.join(data_cleaned['processed_tweet'].astype(str))
# Generate a word cloud
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(combined_tweets)
# Display the generated wordcloud:
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.title('Bag of Word representation of All Tweets')
plt.show()
# Combine the processed tweets into a single string
combined_tweets = ' '.join(data_cleaned_inflation['processed_tweet'].astype(str))
# Generate a word cloud
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(combined_tweets)
# Display the generated wordcloud:
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.title('Bag of Word representation of Tweets about Inflation')
plt.show()
# Combine the processed tweets into a single string
combined_tweets = ' '.join(data_cleaned_prof['processed_tweet'].astype(str))
# Generate a word cloud
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(combined_tweets)
# Display the generated wordcloud:
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.title('Bag of Word representation of All Tweets by Professionals')
plt.show()
# Combine the processed tweets into a single string
combined_tweets = ' '.join(data_cleaned_inflation_prof['processed_tweet'].astype(str))
# Generate a word cloud
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(combined_tweets)
# Display the generated wordcloud:
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.title('Bag of Word representation of Tweets about Inflation by Professionals')
plt.show()
'CPHPTT01GBM659N.csv'
'GBRCPALTT01CTGYM.csv'
'GBRCPIALLMINMEI.csv'
'GBRCPIENGMINMEI.csv'
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
def get_top_n_words(corpus, n=None):
vec = CountVectorizer(stop_words='english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_words(data_cleaned['processed_tweet'] , 20)
unigram = pd.DataFrame(common_words, columns = ['unigram' , 'count'])
allTweetsTop15Unigrams = unigram.drop(index=unigram.loc[unigram['unigram']=='rt',].index)
allTweetsTop15Unigrams = allTweetsTop15Unigrams.drop(index=allTweetsTop15Unigrams.loc[allTweetsTop15Unigrams['unigram']=='amp',].index)
allTweetsTop15Unigrams = allTweetsTop15Unigrams.reset_index(drop=True).iloc[:15].unigram.to_list()
unigram = unigram.iloc[:15]
plt.figure(figsize=(12,7))
plt.bar(x = unigram['unigram'], height = unigram['count'] )
plt.title('Frequency of top 15 Unigrams from All Tweets')
plt.ylabel('Counts')
plt.xlabel('Unigram');
common_words = get_top_n_words(data_cleaned_inflation['processed_tweet'] , 16)
unigram = pd.DataFrame(common_words, columns = ['unigram' , 'count'])
inflationTweetsTop15Unigrams = unigram.drop(index=unigram.loc[unigram['unigram']=='rt',].index)
inflationTweetsTop15Unigrams = inflationTweetsTop15Unigrams.drop(index=inflationTweetsTop15Unigrams.loc[inflationTweetsTop15Unigrams['unigram']=='amp',].index)
inflationTweetsTop15Unigrams = inflationTweetsTop15Unigrams.reset_index(drop=True).iloc[:15].unigram.to_list()
plt.figure(figsize=(12,7))
plt.bar(x = unigram['unigram'], height = unigram['count'])
plt.title('Frequency of top 15 Unigrams from Inflation Tweets')
plt.ylabel('Frequency')
plt.xlabel('Unigram')
plt.xticks(rotation=30);
common_words = get_top_n_words(data_cleaned_inflation_prof['processed_tweet'], 16)
unigram = pd.DataFrame(common_words, columns = ['unigram' , 'count'])
inflationProfessionalTweetsTop15Unigrams = unigram.drop(index=unigram.loc[unigram['unigram']=='rt',].index)
inflationProfessionalTweetsTop15Unigrams = inflationProfessionalTweetsTop15Unigrams.drop(index=inflationProfessionalTweetsTop15Unigrams.loc[inflationProfessionalTweetsTop15Unigrams['unigram']=='amp',].index)
inflationProfessionalTweetsTop15Unigrams = inflationProfessionalTweetsTop15Unigrams.reset_index(drop=True).iloc[:15].unigram.to_list()
plt.figure(figsize=(12,7))
plt.bar(x = unigram['unigram'], height = unigram['count'])
plt.title('Frequency of top 15 Unigrams from Inflation Tweets by Professionals')
plt.ylabel('Frequency')
plt.xlabel('Unigram')
plt.xticks(rotation=60);
def get_top_n_bigram(corpus, n=None):
vec = CountVectorizer(ngram_range=(2,2),stop_words='english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_bigram(data_cleaned['processed_tweet'] , 16)
bigram = pd.DataFrame(common_words, columns = ['bigram' , 'count'])
allTweetsTop15Bigrams = bigram.iloc[:15].bigram.to_list()
plt.figure(figsize=(12,7))
plt.bar(x = bigram['bigram'], height = bigram['count'] )
plt.title('Frequency of top 15 Bigrams from All Tweets')
plt.ylabel('Counts')
plt.xlabel('Bigram')
plt.xticks(rotation=30);
common_words = get_top_n_bigram(data_cleaned_inflation['processed_tweet'] , 16)
bigram = pd.DataFrame(common_words, columns = ['bigram' , 'count'])
inflationTweetsTop15Bigrams = bigram.iloc[:15].bigram.to_list()
plt.figure(figsize=(12,7))
plt.bar(x = bigram['bigram'], height = bigram['count'])
plt.title('Frequency of top 15 Bigrams from Inflation Tweets')
plt.ylabel('Frequency')
plt.xlabel('Bigram')
plt.xticks(rotation=30);
common_words = get_top_n_bigram(data_cleaned_inflation_prof['processed_tweet'], 16)
bigram = pd.DataFrame(common_words, columns = ['bigram' , 'count'])
inflationProfessionalTweetsTop15Bigrams = bigram.iloc[:15].bigram.to_list()
plt.figure(figsize=(12,7))
plt.bar(x = bigram['bigram'], height = bigram['count'])
plt.title('Frequency of top 15 Bigrams from Inflation Tweets by Professionals')
plt.ylabel('Frequency')
plt.xlabel('Bigram')
plt.xticks(rotation=30);
def get_top_n_trigram(corpus, n=None):
vec = CountVectorizer(ngram_range=(3,3),stop_words='english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_trigram(data_cleaned['processed_tweet'], 100)
trigram = pd.DataFrame(common_words, columns = ['trigram' , 'count'])
listToRemove = [
'24hour trade volum',
'price predict sentiment',
'predict sentiment current',
'rt bori johnson',
'rt rishi sunak',
'pleas click link',
'click link origin',
'bell pleas click',
'link origin tweet',
'origin tweet join',
'rt went cinema',
'went cinema yest',
'cinema yest 845pm',
'yest 845pm 917',
'845pm 917 film',
'917 film actual',
'film actual start',
'actual start cinema',
'start cinema pay',
'cinema pay films',
'public sector worker',
'rt liz truss',
'rt woke morn',
'woke morn radio',
'morn radio talk',
'radio talk cost',
]
temp = trigram.drop(index=trigram[trigram['trigram'].apply(lambda x: any([k == x for k in listToRemove]))].index).reset_index(drop=True)
allTweetsTop15Trigrams = temp.iloc[:15].trigram.to_list()
trigram = trigram.iloc[:15]
plt.figure(figsize=(12,7))
plt.bar(x = trigram['trigram'], height = trigram['count'])
plt.title('Frequency of top 15 Trigrams from All Tweets')
plt.ylabel('Frequency')
plt.xlabel('Trigram')
plt.xticks(rotation=60);
common_words = get_top_n_trigram(data_cleaned_inflation['processed_tweet'], 100)
trigram = pd.DataFrame(common_words, columns = ['trigram' , 'count'])
listToRemove = [
'hour etn electroneum',
'etn electroneum cryptocurr',
'rt someth odd',
'someth odd afoot',
'odd afoot natur',
]
temp = trigram.drop(index=trigram[trigram['trigram'].apply(lambda x: any([k == x for k in listToRemove]))].index).reset_index(drop=True)
inflationTweetsTop15Trigrams = temp.iloc[:15].trigram.to_list()
trigram = trigram.iloc[:15]
plt.figure(figsize=(12,7))
plt.bar(x = trigram['trigram'], height = trigram['count'])
plt.title('Frequency of top 15 Trigrams from Inflation Tweets')
plt.ylabel('Frequency')
plt.xlabel('Trigram')
plt.xticks(rotation=60);
common_words = get_top_n_trigram(data_cleaned_inflation_prof['processed_tweet'], 16)
trigram = pd.DataFrame(common_words, columns = ['trigram' , 'count'])
temp = trigram.drop(index=trigram[trigram['trigram'] == 'worri save pace'].index).reset_index(drop=True)
inflationProfessionalTweetsTop15Trigrams = temp.iloc[:15].trigram.to_list()
trigram = trigram.iloc[:15]
plt.figure(figsize=(12,7))
plt.bar(x = trigram['trigram'], height = trigram['count'])
plt.title('Frequency of top 15 Trigrams from Inflation Tweets by Professionals')
plt.ylabel('Frequency')
plt.xlabel('Trigram')
plt.xticks(rotation=60);
top15grams = [
allTweetsTop15Unigrams, allTweetsTop15Bigrams, allTweetsTop15Trigrams,
inflationTweetsTop15Unigrams, inflationTweetsTop15Bigrams, inflationTweetsTop15Trigrams,
inflationProfessionalTweetsTop15Unigrams, inflationProfessionalTweetsTop15Bigrams, inflationProfessionalTweetsTop15Trigrams
]
top15grams_cols = [
'allTweetsTop15Unigrams', 'allTweetsTop15Bigrams', 'allTweetsTop15Trigrams',
'inflationTweetsTop15Unigrams', 'inflationTweetsTop15Bigrams', 'inflationTweetsTop15Trigrams',
'inflationProfessionalTweetsTop15Unigrams', 'inflationProfessionalTweetsTop15Bigrams', 'inflationProfessionalTweetsTop15Trigrams'
]
for i in range(len(top15grams)):
for gram in tqdm(top15grams[i]):
data_cleaned[top15grams_cols[i]] = data_cleaned.processed_tweet.str.contains(gram)
0%| | 0/15 [00:00<?, ?it/s]
0%| | 0/15 [00:00<?, ?it/s]
0%| | 0/15 [00:00<?, ?it/s]
0%| | 0/14 [00:00<?, ?it/s]
0%| | 0/15 [00:00<?, ?it/s]
0%| | 0/15 [00:00<?, ?it/s]
0%| | 0/14 [00:00<?, ?it/s]
0%| | 0/15 [00:00<?, ?it/s]
0%| | 0/15 [00:00<?, ?it/s]
data_cleaned.groupby('Month').sum().iloc[:,-9:].plot(figsize=(12,7))
plt.title('Monthly Frequency of Tweets that Include any of the top 15 Grams about Inflation');
temp = data_cleaned.groupby('Month').sum().iloc[:,-9:].reset_index()
temp = temp.rename(columns={'Month': 'Month'})
temp['Month'] = pd.to_datetime(temp['Month'])
temp = temp.set_index('Month')
temp = temp.rename(columns={col: 'sumOf'+col for col in temp.columns})
indicators_df = pd.concat([indicators_df, temp], axis=1)
temp = data_cleaned.groupby('Month').mean().iloc[:,-9:].reset_index()
temp = temp.rename(columns={'Month': 'Month'})
temp['Month'] = pd.to_datetime(temp['Month'])
temp = temp.set_index('Month')
temp = temp.rename(columns={col: 'meanOf'+col for col in temp.columns})
indicators_df = pd.concat([indicators_df, temp], axis=1)
data_cleaned.to_csv('tweets_cleaned_original.csv')
indicators_df.to_csv('inflation_indicators.csv')
((indicators_df-indicators_df.min())/(indicators_df.max()-indicators_df.min())).plot(figsize=(15, 7))
plt.title('46 Monthly Indicators from Tweets Data');
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-distilroberta-v1')
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
#Sentences we want to encode. Example:
sentence = ['This framework generates embeddings for each input sentence']
#Sentences are encoded by calling model.encode()
embedding = model.encode(sentence)
embedding.shape
(1, 768)
df_body_month = ['. '.join(group['body']) for date, group in data_cleaned.groupby('Month')]
df_inflation_body_month = ['. '.join(group['body']) for date, group in data_cleaned_inflation.groupby('Month')]
df_professionals_body_month = ['. '.join(group['body']) for date, group in data_cleaned_prof.groupby('Month')]
df_professionals_inflation_body_month = ['. '.join(group['body']) for date, group in data_cleaned_inflation_prof.groupby('Month')]
month_agg_bert = model.encode(df_body_month, batch_size = 250, show_progress_bar = True)
month_agg_bert_inflation = model.encode(df_inflation_body_month, batch_size = 250, show_progress_bar = True)
month_agg_bert_professionals = model.encode(df_professionals_body_month, batch_size = 250, show_progress_bar = True)
month_agg_bert_professionals_inflation = model.encode(df_professionals_inflation_body_month, batch_size = 250, show_progress_bar = True)
Batches: 0%| | 0/1 [00:00<?, ?it/s]
Batches: 0%| | 0/1 [00:00<?, ?it/s]
Batches: 0%| | 0/1 [00:00<?, ?it/s]
Batches: 0%| | 0/1 [00:00<?, ?it/s]
month_agg_bert.shape
(60, 768)
month_agg_bert_inflation.shape
(60, 768)
with open('./month_agg_bert.pkl', 'wb') as f:
pickle.dump(month_agg_bert, f)
with open('./month_agg_bert_inflation.pkl', 'wb') as f:
pickle.dump(month_agg_bert_inflation, f)
with open('./month_agg_bert_professionals.pkl', 'wb') as f:
pickle.dump(month_agg_bert_professionals, f)
with open('./month_agg_bert_professionals_inflation.pkl', 'wb') as f:
pickle.dump(month_agg_bert_professionals_inflation, f)
CPI_allItems = pd.read_csv('GBRCPIALLMINMEI.csv')
CPI_allItems = CPI_allItems.rename(columns={'GBRCPIALLMINMEI': 'CPI_allItems'})
CPI_allItems['DATE'] = pd.to_datetime(CPI_allItems['DATE'])
CPI_allItems = CPI_allItems.set_index('DATE')
CPI_allItems = CPI_allItems['2018-01-01': '2022-12-31']
CPI_allItems.plot(figsize=(12,7))
plt.title('Consumer Price Index of All Items in the United Kingdom (GBRCPIALLMINMEI)')
;
''
CPI_allItems
| CPI_allItems | |
|---|---|
| DATE | |
| 2018-01-01 | 104.5 |
| 2018-02-01 | 104.9 |
| 2018-03-01 | 105.1 |
| 2018-04-01 | 105.5 |
| 2018-05-01 | 105.9 |
| 2018-06-01 | 105.9 |
| 2018-07-01 | 105.9 |
| 2018-08-01 | 106.5 |
| 2018-09-01 | 106.6 |
| 2018-10-01 | 106.7 |
| 2018-11-01 | 106.9 |
| 2018-12-01 | 107.1 |
| 2019-01-01 | 106.4 |
| 2019-02-01 | 106.8 |
| 2019-03-01 | 107.0 |
| 2019-04-01 | 107.6 |
| 2019-05-01 | 107.9 |
| 2019-06-01 | 107.9 |
| 2019-07-01 | 108.0 |
| 2019-08-01 | 108.3 |
| 2019-09-01 | 108.4 |
| 2019-10-01 | 108.3 |
| 2019-11-01 | 108.5 |
| 2019-12-01 | 108.5 |
| 2020-01-01 | 108.3 |
| 2020-02-01 | 108.6 |
| 2020-03-01 | 108.6 |
| 2020-04-01 | 108.6 |
| 2020-05-01 | 108.6 |
| 2020-06-01 | 108.8 |
| 2020-07-01 | 109.2 |
| 2020-08-01 | 108.8 |
| 2020-09-01 | 109.2 |
| 2020-10-01 | 109.2 |
| 2020-11-01 | 109.1 |
| 2020-12-01 | 109.4 |
| 2021-01-01 | 109.3 |
| 2021-02-01 | 109.4 |
| 2021-03-01 | 109.7 |
| 2021-04-01 | 110.4 |
| 2021-05-01 | 111.0 |
| 2021-06-01 | 111.4 |
| 2021-07-01 | 111.4 |
| 2021-08-01 | 112.1 |
| 2021-09-01 | 112.4 |
| 2021-10-01 | 113.4 |
| 2021-11-01 | 114.1 |
| 2021-12-01 | 114.7 |
| 2022-01-01 | 114.6 |
| 2022-02-01 | 115.4 |
| 2022-03-01 | 116.5 |
| 2022-04-01 | 119.0 |
| 2022-05-01 | 119.7 |
| 2022-06-01 | 120.5 |
| 2022-07-01 | 121.2 |
| 2022-08-01 | 121.8 |
| 2022-09-01 | 122.3 |
| 2022-10-01 | 124.3 |
| 2022-11-01 | 124.8 |
| 2022-12-01 | 125.3 |
CPIH_annualRate = pd.read_csv('GBRCPALTT01CTGYM.csv')
CPIH_annualRate = CPIH_annualRate.rename(columns={'GBRCPALTT01CTGYM': 'CPIH_annualRate'})
CPIH_annualRate['DATE'] = pd.to_datetime(CPIH_annualRate['DATE'])
CPIH_annualRate = CPIH_annualRate.set_index('DATE')
CPIH_annualRate = CPIH_annualRate['2018-01-01': '2022-12-31']
CPIH_annualRate.plot(figsize=(12,7))
plt.title('Consumer Price Index: All items: Total: Total for the United Kingdom (GBRCPALTT01CTGYM)')
;
''
CPIH_annualRate
| CPIH_annualRate | |
|---|---|
| DATE | |
| 2018-01-01 | 2.71 |
| 2018-02-01 | 2.45 |
| 2018-03-01 | 2.29 |
| 2018-04-01 | 2.20 |
| 2018-05-01 | 2.30 |
| 2018-06-01 | 2.30 |
| 2018-07-01 | 2.29 |
| 2018-08-01 | 2.39 |
| 2018-09-01 | 2.20 |
| 2018-10-01 | 2.23 |
| 2018-11-01 | 2.15 |
| 2018-12-01 | 2.01 |
| 2019-01-01 | 1.77 |
| 2019-02-01 | 1.81 |
| 2019-03-01 | 1.84 |
| 2019-04-01 | 2.00 |
| 2019-05-01 | 1.96 |
| 2019-06-01 | 1.94 |
| 2019-07-01 | 1.99 |
| 2019-08-01 | 1.70 |
| 2019-09-01 | 1.69 |
| 2019-10-01 | 1.50 |
| 2019-11-01 | 1.50 |
| 2019-12-01 | 1.37 |
| 2020-01-01 | 1.75 |
| 2020-02-01 | 1.73 |
| 2020-03-01 | 1.55 |
| 2020-04-01 | 0.92 |
| 2020-05-01 | 0.68 |
| 2020-06-01 | 0.80 |
| 2020-07-01 | 1.15 |
| 2020-08-01 | 0.51 |
| 2020-09-01 | 0.75 |
| 2020-10-01 | 0.89 |
| 2020-11-01 | 0.58 |
| 2020-12-01 | 0.82 |
| 2021-01-01 | 0.93 |
| 2021-02-01 | 0.73 |
| 2021-03-01 | 0.96 |
| 2021-04-01 | 1.64 |
| 2021-05-01 | 2.12 |
| 2021-06-01 | 2.45 |
| 2021-07-01 | 2.09 |
| 2021-08-01 | 3.03 |
| 2021-09-01 | 2.92 |
| 2021-10-01 | 3.83 |
| 2021-11-01 | 4.58 |
| 2021-12-01 | 4.84 |
| 2022-01-01 | 4.90 |
| 2022-02-01 | 5.48 |
| 2022-03-01 | 6.22 |
| 2022-04-01 | 7.78 |
| 2022-05-01 | 7.87 |
| 2022-06-01 | 8.18 |
| 2022-07-01 | 8.76 |
| 2022-08-01 | 8.61 |
| 2022-09-01 | 8.81 |
| 2022-10-01 | 9.59 |
| 2022-11-01 | 9.35 |
| 2022-12-01 | 9.24 |
CPI_harmonizedPrices = pd.read_csv('CPHPTT01GBM659N.csv')
CPI_harmonizedPrices = CPI_harmonizedPrices.rename(columns={'CPHPTT01GBM659N': 'CPI_harmonizedPrices'})
CPI_harmonizedPrices['DATE'] = pd.to_datetime(CPI_harmonizedPrices['DATE'])
CPI_harmonizedPrices = CPI_harmonizedPrices.set_index('DATE')
CPI_harmonizedPrices = CPI_harmonizedPrices['2018-01-01': '2022-12-31']
CPI_harmonizedPrices.plot(figsize=(12,7))
plt.title('Consumer Price Index: Harmonized Prices: Total All Items for the United Kingdom (CPHPTT01GBM659N)')
;
''
CPI_harmonizedPrices
| CPI_harmonizedPrices | |
|---|---|
| DATE | |
| 2018-01-01 | 3.0 |
| 2018-02-01 | 2.7 |
| 2018-03-01 | 2.5 |
| 2018-04-01 | 2.4 |
| 2018-05-01 | 2.4 |
| 2018-06-01 | 2.4 |
| 2018-07-01 | 2.5 |
| 2018-08-01 | 2.7 |
| 2018-09-01 | 2.4 |
| 2018-10-01 | 2.4 |
| 2018-11-01 | 2.3 |
| 2018-12-01 | 2.1 |
| 2019-01-01 | 1.8 |
| 2019-02-01 | 1.9 |
| 2019-03-01 | 1.9 |
| 2019-04-01 | 2.1 |
| 2019-05-01 | 2.0 |
| 2019-06-01 | 2.0 |
| 2019-07-01 | 2.1 |
| 2019-08-01 | 1.7 |
| 2019-09-01 | 1.7 |
| 2019-10-01 | 1.5 |
| 2019-11-01 | 1.5 |
| 2019-12-01 | 1.3 |
| 2020-01-01 | 1.8 |
| 2020-02-01 | 1.7 |
| 2020-03-01 | 1.5 |
| 2020-04-01 | 0.8 |
| 2020-05-01 | 0.5 |
| 2020-06-01 | 0.6 |
| 2020-07-01 | 1.0 |
| 2020-08-01 | 0.2 |
| 2020-09-01 | 0.5 |
| 2020-10-01 | 0.7 |
| 2020-11-01 | 0.3 |
| 2020-12-01 | 0.6 |
| 2021-01-01 | 0.7 |
| 2021-02-01 | 0.4 |
| 2021-03-01 | 0.7 |
| 2021-04-01 | 1.5 |
| 2021-05-01 | 2.1 |
| 2021-06-01 | 2.5 |
| 2021-07-01 | 2.0 |
| 2021-08-01 | 3.2 |
| 2021-09-01 | 3.1 |
| 2021-10-01 | 4.2 |
| 2021-11-01 | 5.1 |
| 2021-12-01 | 5.4 |
| 2022-01-01 | 5.5 |
| 2022-02-01 | 6.2 |
| 2022-03-01 | 7.0 |
| 2022-04-01 | 9.0 |
| 2022-05-01 | 9.1 |
| 2022-06-01 | 9.4 |
| 2022-07-01 | 10.1 |
| 2022-08-01 | 9.9 |
| 2022-09-01 | 10.1 |
| 2022-10-01 | 11.1 |
| 2022-11-01 | 10.7 |
| 2022-12-01 | 10.5 |
CPI_energy = pd.read_csv('GBRCPIENGMINMEI.csv')
CPI_energy = CPI_energy.rename(columns={'GBRCPIENGMINMEI': 'CPI_energy'})
CPI_energy['DATE'] = pd.to_datetime(CPI_energy['DATE'])
CPI_energy = CPI_energy.set_index('DATE')
CPI_energy = CPI_energy['2018-01-01': '2022-12-31']
CPI_energy.plot(figsize=(12,7))
plt.title('Consumer Price Index: Energy for United Kingdom (GBRCPIENGMINMEI)')
;
''
for i in [CPI_allItems, CPIH_annualRate, CPI_energy]:
display(pd.concat([indicators_df, i], axis=1).corr()[i.columns[0]])
print('*'*100)
countOfAllTweets 0.895258 sumOfCompoundSentimentForAllTweets 0.858199 sumOf+veSentimentForAllTweets 0.891288 sumOf-veSentimentForAllTweets -0.874366 meanOfCompoundSentimentForAllTweets -0.681778 meanOf+veSentimentForAllTweets -0.624231 meanOf-veSentimentForAllTweets -0.309846 countOfInflationTweets 0.927208 sumOfCompoundSentimentForInflationTweets 0.806633 sumOf+veSentimentForInflationTweets 0.924151 sumOf-veSentimentForInflationTweets -0.902778 meanOfCompoundSentimentForInflationTweets -0.283025 meanOf+veSentimentForInflationTweets 0.080258 meanOf-veSentimentForInflationTweets -0.185144 countOfProfessionalsTweets 0.905069 sumOfCompoundSentimentForProfessionalsTweets 0.789483 sumOf+veSentimentForProfessionalsTweets 0.872087 sumOf-veSentimentForProfessionalsTweets -0.849522 meanOfCompoundSentimentForProfessionalsTweets -0.056127 meanOf+veSentimentForProfessionalsTweets -0.063096 meanOf-veSentimentForProfessionalsTweets 0.043176 countOfProfessionalsInflationTweets 0.888286 sumOfCompoundSentimentForProfessionalsInflationTweets 0.595754 sumOf+veSentimentForProfessionalsInflationTweets 0.783913 sumOf-veSentimentForProfessionalsInflationTweets -0.769595 meanOfCompoundSentimentForProfessionalsInflationTweets -0.217012 meanOf+veSentimentForProfessionalsInflationTweets -0.123296 meanOf-veSentimentForProfessionalsInflationTweets -0.063830 sumOfallTweetsTop15Unigrams 0.886564 sumOfallTweetsTop15Bigrams 0.776965 sumOfallTweetsTop15Trigrams 0.264379 sumOfinflationTweetsTop15Unigrams 0.836616 sumOfinflationTweetsTop15Bigrams 0.827214 sumOfinflationTweetsTop15Trigrams 0.471364 sumOfinflationProfessionalTweetsTop15Unigrams 0.833177 sumOfinflationProfessionalTweetsTop15Bigrams 0.388015 sumOfinflationProfessionalTweetsTop15Trigrams 0.084913 meanOfallTweetsTop15Unigrams -0.186367 meanOfallTweetsTop15Bigrams 0.505589 meanOfallTweetsTop15Trigrams -0.262998 meanOfinflationTweetsTop15Unigrams -0.346028 meanOfinflationTweetsTop15Bigrams 0.837462 meanOfinflationTweetsTop15Trigrams -0.355777 meanOfinflationProfessionalTweetsTop15Unigrams 0.757799 meanOfinflationProfessionalTweetsTop15Bigrams -0.383295 meanOfinflationProfessionalTweetsTop15Trigrams -0.222807 CPI_allItems 1.000000 Name: CPI_allItems, dtype: float64
****************************************************************************************************
countOfAllTweets 0.893800 sumOfCompoundSentimentForAllTweets 0.855161 sumOf+veSentimentForAllTweets 0.890000 sumOf-veSentimentForAllTweets -0.874672 meanOfCompoundSentimentForAllTweets -0.621749 meanOf+veSentimentForAllTweets -0.487280 meanOf-veSentimentForAllTweets -0.331327 countOfInflationTweets 0.934124 sumOfCompoundSentimentForInflationTweets 0.739366 sumOf+veSentimentForInflationTweets 0.896403 sumOf-veSentimentForInflationTweets -0.921260 meanOfCompoundSentimentForInflationTweets -0.484374 meanOf+veSentimentForInflationTweets -0.009522 meanOf-veSentimentForInflationTweets -0.141995 countOfProfessionalsTweets 0.886528 sumOfCompoundSentimentForProfessionalsTweets 0.782725 sumOf+veSentimentForProfessionalsTweets 0.864849 sumOf-veSentimentForProfessionalsTweets -0.842794 meanOfCompoundSentimentForProfessionalsTweets -0.039461 meanOf+veSentimentForProfessionalsTweets -0.035611 meanOf-veSentimentForProfessionalsTweets -0.017701 countOfProfessionalsInflationTweets 0.876029 sumOfCompoundSentimentForProfessionalsInflationTweets 0.592584 sumOf+veSentimentForProfessionalsInflationTweets 0.787103 sumOf-veSentimentForProfessionalsInflationTweets -0.794449 meanOfCompoundSentimentForProfessionalsInflationTweets -0.199326 meanOf+veSentimentForProfessionalsInflationTweets -0.104311 meanOf-veSentimentForProfessionalsInflationTweets -0.135369 sumOfallTweetsTop15Unigrams 0.868673 sumOfallTweetsTop15Bigrams 0.795379 sumOfallTweetsTop15Trigrams 0.132367 sumOfinflationTweetsTop15Unigrams 0.851980 sumOfinflationTweetsTop15Bigrams 0.853064 sumOfinflationTweetsTop15Trigrams 0.476883 sumOfinflationProfessionalTweetsTop15Unigrams 0.845270 sumOfinflationProfessionalTweetsTop15Bigrams 0.372484 sumOfinflationProfessionalTweetsTop15Trigrams 0.200374 meanOfallTweetsTop15Unigrams -0.215128 meanOfallTweetsTop15Bigrams 0.496390 meanOfallTweetsTop15Trigrams -0.376733 meanOfinflationTweetsTop15Unigrams -0.204591 meanOfinflationTweetsTop15Bigrams 0.866670 meanOfinflationTweetsTop15Trigrams -0.234091 meanOfinflationProfessionalTweetsTop15Unigrams 0.712579 meanOfinflationProfessionalTweetsTop15Bigrams -0.328568 meanOfinflationProfessionalTweetsTop15Trigrams -0.051652 CPIH_annualRate 1.000000 Name: CPIH_annualRate, dtype: float64
****************************************************************************************************
countOfAllTweets 0.862237 sumOfCompoundSentimentForAllTweets 0.832381 sumOf+veSentimentForAllTweets 0.860334 sumOf-veSentimentForAllTweets -0.840518 meanOfCompoundSentimentForAllTweets -0.598448 meanOf+veSentimentForAllTweets -0.478269 meanOf-veSentimentForAllTweets -0.322546 countOfInflationTweets 0.921756 sumOfCompoundSentimentForInflationTweets 0.756066 sumOf+veSentimentForInflationTweets 0.891156 sumOf-veSentimentForInflationTweets -0.893598 meanOfCompoundSentimentForInflationTweets -0.407209 meanOf+veSentimentForInflationTweets 0.062480 meanOf-veSentimentForInflationTweets -0.127596 countOfProfessionalsTweets 0.832976 sumOfCompoundSentimentForProfessionalsTweets 0.710700 sumOf+veSentimentForProfessionalsTweets 0.804075 sumOf-veSentimentForProfessionalsTweets -0.810095 meanOfCompoundSentimentForProfessionalsTweets -0.076744 meanOf+veSentimentForProfessionalsTweets -0.057291 meanOf-veSentimentForProfessionalsTweets 0.005720 countOfProfessionalsInflationTweets 0.835153 sumOfCompoundSentimentForProfessionalsInflationTweets 0.554669 sumOf+veSentimentForProfessionalsInflationTweets 0.757084 sumOf-veSentimentForProfessionalsInflationTweets -0.784205 meanOfCompoundSentimentForProfessionalsInflationTweets -0.202995 meanOf+veSentimentForProfessionalsInflationTweets -0.105320 meanOf-veSentimentForProfessionalsInflationTweets -0.118329 sumOfallTweetsTop15Unigrams 0.821367 sumOfallTweetsTop15Bigrams 0.777254 sumOfallTweetsTop15Trigrams 0.161201 sumOfinflationTweetsTop15Unigrams 0.801764 sumOfinflationTweetsTop15Bigrams 0.862424 sumOfinflationTweetsTop15Trigrams 0.519073 sumOfinflationProfessionalTweetsTop15Unigrams 0.832104 sumOfinflationProfessionalTweetsTop15Bigrams 0.441283 sumOfinflationProfessionalTweetsTop15Trigrams 0.194833 meanOfallTweetsTop15Unigrams -0.256574 meanOfallTweetsTop15Bigrams 0.484365 meanOfallTweetsTop15Trigrams -0.340015 meanOfinflationTweetsTop15Unigrams -0.243403 meanOfinflationTweetsTop15Bigrams 0.878194 meanOfinflationTweetsTop15Trigrams -0.228498 meanOfinflationProfessionalTweetsTop15Unigrams 0.693651 meanOfinflationProfessionalTweetsTop15Bigrams -0.273073 meanOfinflationProfessionalTweetsTop15Trigrams -0.078649 CPI_energy 1.000000 Name: CPI_energy, dtype: float64
****************************************************************************************************
inflation_train = CPIH_annualRate[:'2021-12-01']
inflation_test = CPIH_annualRate['2022-01-01':]
inflation_train.shape, inflation_test.shape
((48, 1), (12, 1))
indicators_df = pd.read_csv('inflation_indicators.csv')
indicators_df['Month'] = pd.to_datetime(indicators_df['Month'])
indicators_df = indicators_df.set_index('Month')
indicators_train = indicators_df[:'2021-12-01']
indicators_test = indicators_df['2022-01-01':]
indicators_train.shape, indicators_test.shape
((48, 46), (12, 46))
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
result = adfuller(CPIH_annualRate['CPIH_annualRate'])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
ADF Statistic: -4.625624 p-value: 0.000116
from statsmodels.tsa.seasonal import seasonal_decompose
import itertools
import statsmodels.api as sm
decomposition = seasonal_decompose(CPIH_annualRate, model='additive')
fig = decomposition.plot()
plt.show()
ARIMA stands for Autoregressive Integrated Moving Average.
ARIMA models are denoted with the notation ARIMA(p, d, q). where,
p is the order of the AR term
q is the order of the MA term
d is the number of differencing required to make the time series stationary
Also, these three parameters account for seasonality, trend, and noise in data:
# Create p, d, q combinations.
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
Examples of parameter combinations for Seasonal ARIMA... SARIMAX: (0, 0, 1) x (0, 0, 1, 12) SARIMAX: (0, 0, 1) x (0, 1, 0, 12) SARIMAX: (0, 1, 0) x (0, 1, 1, 12) SARIMAX: (0, 1, 0) x (1, 0, 0, 12)
pdq_x_seasonal_pdq = []
aic_ = []
for param in pdq:
for param_seasonal in seasonal_pdq:
mod = sm.tsa.statespace.SARIMAX(inflation_train,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
pdq_x_seasonal_pdq.append('ARIMA{}x{}12'.format(param, param_seasonal))
aic_.append(results.aic)
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\base\model.py:606: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals ConvergenceWarning) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\statespace\sarimax.py:868: UserWarning: Too few observations to estimate starting parameters for seasonal ARMA. All parameters except for variances will be set to zeros. ' zeros.' % warning_description) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
ARIMA(0, 0, 0)x(0, 0, 0, 12)12 - AIC:204.50976379309475 ARIMA(0, 0, 0)x(0, 0, 1, 12)12 - AIC:1409.4496884975517 ARIMA(0, 0, 0)x(0, 1, 0, 12)12 - AIC:126.51928992417376 ARIMA(0, 0, 0)x(0, 1, 1, 12)12 - AIC:87.31859037068914 ARIMA(0, 0, 0)x(1, 0, 0, 12)12 - AIC:129.78278058473455
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\base\model.py:606: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals ConvergenceWarning) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\base\model.py:606: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals ConvergenceWarning)
ARIMA(0, 0, 0)x(1, 0, 1, 12)12 - AIC:1653.4502309754685 ARIMA(0, 0, 0)x(1, 1, 0, 12)12 - AIC:91.73401170532784 ARIMA(0, 0, 0)x(1, 1, 1, 12)12 - AIC:86.37957603888844 ARIMA(0, 0, 1)x(0, 0, 0, 12)12 - AIC:151.53637401761267 ARIMA(0, 0, 1)x(0, 0, 1, 12)12 - AIC:1446.8427036816447 ARIMA(0, 0, 1)x(0, 1, 0, 12)12 - AIC:98.64993163134797 ARIMA(0, 0, 1)x(0, 1, 1, 12)12 - AIC:68.74266721035747 ARIMA(0, 0, 1)x(1, 0, 0, 12)12 - AIC:103.42515175347589
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\base\model.py:606: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals ConvergenceWarning)
ARIMA(0, 0, 1)x(1, 0, 1, 12)12 - AIC:1443.0215346760033 ARIMA(0, 0, 1)x(1, 1, 0, 12)12 - AIC:74.47030961807272 ARIMA(0, 0, 1)x(1, 1, 1, 12)12 - AIC:66.46218576493284 ARIMA(0, 1, 0)x(0, 0, 0, 12)12 - AIC:31.76059478424028 ARIMA(0, 1, 0)x(0, 0, 1, 12)12 - AIC:1417.753293088088 ARIMA(0, 1, 0)x(0, 1, 0, 12)12 - AIC:55.08605819888462
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\base\model.py:606: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals ConvergenceWarning) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
ARIMA(0, 1, 0)x(0, 1, 1, 12)12 - AIC:38.6323561251205 ARIMA(0, 1, 0)x(1, 0, 0, 12)12 - AIC:26.55223638113195 ARIMA(0, 1, 0)x(1, 0, 1, 12)12 - AIC:1432.18062946169 ARIMA(0, 1, 0)x(1, 1, 0, 12)12 - AIC:38.44494150070333 ARIMA(0, 1, 0)x(1, 1, 1, 12)12 - AIC:36.638457890058575 ARIMA(0, 1, 1)x(0, 0, 0, 12)12 - AIC:33.464581210394314
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\base\model.py:606: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals ConvergenceWarning) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\base\model.py:606: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals ConvergenceWarning) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
ARIMA(0, 1, 1)x(0, 0, 1, 12)12 - AIC:1376.8388238065734 ARIMA(0, 1, 1)x(0, 1, 0, 12)12 - AIC:56.38061385688948 ARIMA(0, 1, 1)x(0, 1, 1, 12)12 - AIC:39.90840332460787 ARIMA(0, 1, 1)x(1, 0, 0, 12)12 - AIC:27.912646573735277 ARIMA(0, 1, 1)x(1, 0, 1, 12)12 - AIC:1606.4968949196268 ARIMA(0, 1, 1)x(1, 1, 0, 12)12 - AIC:40.39973650998163 ARIMA(0, 1, 1)x(1, 1, 1, 12)12 - AIC:37.42531888625585 ARIMA(1, 0, 0)x(0, 0, 0, 12)12 - AIC:32.575330359161555 ARIMA(1, 0, 0)x(0, 0, 1, 12)12 - AIC:1527.947337791632 ARIMA(1, 0, 0)x(0, 1, 0, 12)12 - AIC:57.434902984261804 ARIMA(1, 0, 0)x(0, 1, 1, 12)12 - AIC:40.57650521009774 ARIMA(1, 0, 0)x(1, 0, 0, 12)12 - AIC:27.557834498175676
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\base\model.py:606: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals ConvergenceWarning) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
ARIMA(1, 0, 0)x(1, 0, 1, 12)12 - AIC:1523.9723432379396 ARIMA(1, 0, 0)x(1, 1, 0, 12)12 - AIC:39.9566628340534 ARIMA(1, 0, 0)x(1, 1, 1, 12)12 - AIC:39.22768188899077 ARIMA(1, 0, 1)x(0, 0, 0, 12)12 - AIC:33.78641279375394 ARIMA(1, 0, 1)x(0, 0, 1, 12)12 - AIC:1393.1900960202647
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\base\model.py:606: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals ConvergenceWarning) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
ARIMA(1, 0, 1)x(0, 1, 0, 12)12 - AIC:58.31815969337335 ARIMA(1, 0, 1)x(0, 1, 1, 12)12 - AIC:42.11001608314108 ARIMA(1, 0, 1)x(1, 0, 0, 12)12 - AIC:29.17903041294758 ARIMA(1, 0, 1)x(1, 0, 1, 12)12 - AIC:1389.3843306328656
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\base\model.py:606: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals ConvergenceWarning) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
ARIMA(1, 0, 1)x(1, 1, 0, 12)12 - AIC:41.95285195032536 ARIMA(1, 0, 1)x(1, 1, 1, 12)12 - AIC:39.919311345338045 ARIMA(1, 1, 0)x(0, 0, 0, 12)12 - AIC:33.08117076425867 ARIMA(1, 1, 0)x(0, 0, 1, 12)12 - AIC:1346.5282265824144 ARIMA(1, 1, 0)x(0, 1, 0, 12)12 - AIC:57.07598972161245 ARIMA(1, 1, 0)x(0, 1, 1, 12)12 - AIC:40.48024032907412
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\base\model.py:606: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals ConvergenceWarning) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
ARIMA(1, 1, 0)x(1, 0, 0, 12)12 - AIC:27.756117678109284 ARIMA(1, 1, 0)x(1, 0, 1, 12)12 - AIC:1583.0849794830704 ARIMA(1, 1, 0)x(1, 1, 0, 12)12 - AIC:39.566263624835244 ARIMA(1, 1, 0)x(1, 1, 1, 12)12 - AIC:37.45277065672283 ARIMA(1, 1, 1)x(0, 0, 0, 12)12 - AIC:31.579604298115694
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\base\model.py:606: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals ConvergenceWarning) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\base\model.py:606: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals ConvergenceWarning) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
ARIMA(1, 1, 1)x(0, 0, 1, 12)12 - AIC:1144.3942437214494 ARIMA(1, 1, 1)x(0, 1, 0, 12)12 - AIC:57.940916183397576 ARIMA(1, 1, 1)x(0, 1, 1, 12)12 - AIC:38.093110974475614 ARIMA(1, 1, 1)x(1, 0, 0, 12)12 - AIC:26.197315695141604 ARIMA(1, 1, 1)x(1, 0, 1, 12)12 - AIC:1374.0520592532214 ARIMA(1, 1, 1)x(1, 1, 0, 12)12 - AIC:40.775563343371005 ARIMA(1, 1, 1)x(1, 1, 1, 12)12 - AIC:33.246439742522355
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\base\model.py:606: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals ConvergenceWarning) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
min(aic_)
26.197315695141604
least_aic_position = aic_.index(min(aic_))
pdq_x_seasonal_pdq[least_aic_position]
'ARIMA(1, 1, 1)x(1, 0, 0, 12)12'
mod = sm.tsa.statespace.SARIMAX(inflation_train,
order=(1, 1, 1),
seasonal_order=(1, 0, 0, 12),
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.9922 0.172 5.758 0.000 0.654 1.330
ma.L1 -0.7738 0.231 -3.355 0.001 -1.226 -0.322
ar.S.L12 -0.7311 0.243 -3.004 0.003 -1.208 -0.254
sigma2 0.0973 0.029 3.340 0.001 0.040 0.154
==============================================================================
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
mod = sm.tsa.statespace.SARIMAX(inflation_train,
order=(1, 1, 1),
seasonal_order=(1, 0, 0, 12),
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.9922 0.172 5.758 0.000 0.654 1.330
ma.L1 -0.7738 0.231 -3.355 0.001 -1.226 -0.322
ar.S.L12 -0.7311 0.243 -3.004 0.003 -1.208 -0.254
sigma2 0.0973 0.029 3.340 0.001 0.040 0.154
==============================================================================
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
results.plot_diagnostics(figsize=(16, 8))
plt.show()
The model residuals are near normally distributed.
pred = results.get_prediction(start=pd.to_datetime('2022-01-01'),
end=pd.to_datetime('2022-12-01'))
pred_ci = pred.conf_int()
ax = CPIH_annualRate.plot(label='Ground Truth')
pred.predicted_mean.plot(ax=ax, label='Forecast', alpha=.7, figsize=(14, 7), color='green')
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.vlines(inflation_test.index[0], 0, 10, colors='r', linestyles='dashed')
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show()
monthly_data_forecasted = pred.predicted_mean
monthly_data_truth = inflation_test.CPIH_annualRate
mse = ((monthly_data_forecasted - monthly_data_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
The Mean Squared Error of our forecasts is 4.67
print('The Root Mean Squared Error of our forecasts is {}'.format(round(np.sqrt(mse), 2)))
The Root Mean Squared Error of our forecasts is 2.16
A Multivariate ARIMA model, also known as a Vector Autoregression Moving-Average (VARMA) model, extends the univariate ARIMA to the multivariate case. It uses more than one time-dependent variable. The variables are treated symmetrically in a VARMA model, each being a linear function of past lags of itself and past lags of the other variables.
Here are the steps to apply a VARMA model:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.varmax import VARMAX
from statsmodels.tsa.stattools import adfuller
df = pd.read_csv('data.csv')
def adf_test(series):
result = adfuller(series, autolag='AIC')
if result[1] <= 0.05:
print('Stationary')
else:
print('Non-stationary')
for col in df.columns:
adf_test(df[col])
# Log transformation and differencing
df = np.log(df).diff().dropna()
# 5. *Splitting Data*: Split data into training and test datasets.
train_size = int(0.8 * len(df))
train, test = df[:train_size], df[train_size:]
model = VARMAX(train, order=(1, 1))
model_fit = model.fit(disp=False)
# 7. *Make Predictions*: Predict the test data and evaluate the model's performance.
predictions = model_fit.forecast(steps=len(test))
# plot the predictions and actual values
plt.figure(figsize=(12, 6))
for i, col in enumerate(test.columns):
plt.subplot(len(test.columns), 1, i+1)
plt.plot(predictions.index, predictions[col], color='red', label='Predicted')
plt.plot(test.index, test[col], color='blue', label='Actual')
plt.title('Forecast vs Actuals for ' + col)
plt.legend(loc='upper left', fontsize=8)
plt.show()
Please note that choosing the order (p, q) is usually done by examining Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots or using automatic model selection criteria like Akaike's Information Criterion (AIC) or Bayesian Information Criterion (BIC).
Also, it's worth noting that multivariate time series forecasting is not always better than univariate time series forecasting. It depends on the specific dataset and problem at hand. It might be that a univariate model per feature with exogenous variables from other features performs better. Furthermore, more complex models such as machine learning and deep learning methods can be used when you have multiple features.
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df[['countOfAllTweets']], ], axis=1)
threshold = 0.85
temp = pd.concat([indicators_df, CPIH_annualRate], axis=1).corr()[CPIH_annualRate.columns[0]].apply(lambda x: x if abs(x) >= threshold else None)
temp = temp[~temp.isna()]
highly_correlated_signals = temp.index.to_list()[:-1]
print(len(highly_correlated_signals))
14
highly_correlated_signals
['countOfAllTweets', 'sumOfCompoundSentimentForAllTweets', 'sumOf+veSentimentForAllTweets', 'sumOf-veSentimentForAllTweets', 'countOfInflationTweets', 'sumOf+veSentimentForInflationTweets', 'sumOf-veSentimentForInflationTweets', 'countOfProfessionalsTweets', 'sumOf+veSentimentForProfessionalsTweets', 'countOfProfessionalsInflationTweets', 'sumOfallTweetsTop15Unigrams', 'sumOfinflationTweetsTop15Unigrams', 'sumOfinflationTweetsTop15Bigrams', 'meanOfinflationTweetsTop15Bigrams']
indicators_df.drop(columns=['countOfAllTweets', 'sumOf+veSentimentForAllTweets', 'countOfInflationTweets', 'sumOfCompoundSentimentForAllTweets']).plot(figsize=(15,7))
<AxesSubplot:xlabel='Month'>
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df['sumOfCompoundSentimentForAllTweets']], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
import numpy as np
import pandas as pd
import statsmodels.api as sm
# Create the VAR model
model = sm.tsa.VAR(train_df)
# Define the lag order (can be chosen based on information criteria or domain knowledge)
lag_order = 3
# Fit the VAR model
results = model.fit(lag_order)
# Print the summary of the VAR model
# print(results.summary())
# Forecasting: Provide past values of all features for forecasting the target variable
# Replace the below values with your actual data for forecasting
# past_values = np.array([[18, 12, 13], [20, 15, 16]])
# Forecasted values for the target variable:
# [[23.14325069 17.87603306 19.00275482]]
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
def displayForecastAndError(forecasted, forecast_interval, true, lag_order, n_signals, returnNum=False):
MSE, MAE = calculateMSE_MAE(forecasted, true)
if returnNum:
return MSE, MAE
print(f'For {n_signals}# Signals and for Lag Order = {lag_order}')
print(f'MSE: {MSE:.2f} \t MAE: {MAE:.2f}')
ax = CPIH_annualRate.plot(label='Ground Truth')
pd.Series(forecasted, index=test_df.index).plot(ax=ax, label='Forecast', alpha=.7, figsize=(14, 7), color='green')
ax.fill_between(test_df.index,
forecast_interval[1][:,0],
forecast_interval[2][:,0], color='k', alpha=.2)
ax.vlines(test_df.index[0], 0, 10, colors='r', linestyles='dashed')
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show()
def calculateMSE_MAE(forecasted, true):
MSE, MAE = np.mean((forecasted - true)**2), np.mean(np.abs(forecasted - true))
return MSE, MAE
def testVARFitting(train_df, test_df, returnNum=False, specifyLag=None):
# Create the VAR model
model = sm.tsa.VAR(train_df)
lowestMSE = np.inf
lagOfLowestMSE = None
# Define the lag order (can be chosen based on information criteria or domain knowledge)
for lag_order in range(1, 6):
if specifyLag is not None: lag_order=specifyLag
# Fit the VAR model
results = model.fit(lag_order)
n_steps = 12
past_values = train_df.iloc[-lag_order:].values
# Make predictions
forecast = results.forecast(y=past_values, steps=n_steps)
forecast_interval = results.forecast_interval(y=past_values, steps=n_steps, alpha=0.05)
if returnNum:
MSE, MAE = displayForecastAndError(forecast[:, 0], forecast_interval, test_df['CPIH_annualRate'].values, lag_order, forecast.shape[1]-1, returnNum)
if MSE < lowestMSE:
lowestMSE = MSE
lagOfLowestMSE = lag_order
else:
displayForecastAndError(forecast[:, 0], forecast_interval, test_df['CPIH_annualRate'].values, lag_order, forecast.shape[1]-1, returnNum)
print('-'*100)
if specifyLag is not None: break
if returnNum:
return lowestMSE, lagOfLowestMSE
n_steps = 12
past_values = train_df.iloc[-lag_order:].values
# Make predictions
forecast = results.forecast(y=past_values, steps=n_steps)
forecast_interval = results.forecast_interval(y=past_values, steps=n_steps, alpha=0.05)
# Print the forecasted values for the target variable
# print("Forecasted values for the target variable:")
# print(forecast)
ax = CPIH_annualRate.plot(label='Ground Truth')
pd.Series(forecast[:,0], index=test_df.index).plot(ax=ax, label='Forecast', alpha=.7, figsize=(14, 7), color='green')
ax.fill_between(test_df.index,
forecast_interval[1][:,0],
forecast_interval[2][:,0], color='k', alpha=.2)
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show()
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df['countOfAllTweets']], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
testVARFitting(train_df, test_df)
For 1# Signals and for Lag Order = 1 MSE: 3.78 MAE: 1.75
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 2 MSE: 3.21 MAE: 1.59
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 3 MSE: 2.17 MAE: 1.33
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 4 MSE: 1.05 MAE: 0.87
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 5 MSE: 8.13 MAE: 2.62
----------------------------------------------------------------------------------------------------
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df['countOfInflationTweets']], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
testVARFitting(train_df, test_df)
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
For 1# Signals and for Lag Order = 1 MSE: 3.09 MAE: 1.58
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 2 MSE: 5.37 MAE: 2.03
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 3 MSE: 0.51 MAE: 0.60
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 4 MSE: 41.22 MAE: 4.91
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 5 MSE: 41.95 MAE: 4.96
----------------------------------------------------------------------------------------------------
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df['sumOfCompoundSentimentForAllTweets']], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
testVARFitting(train_df, test_df)
For 1# Signals and for Lag Order = 1 MSE: 5.69 MAE: 2.15
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 2 MSE: 5.96 MAE: 2.16
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 3 MSE: 0.88 MAE: 0.84
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 4 MSE: 4.16 MAE: 1.81
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 5 MSE: 3.66 MAE: 1.71
----------------------------------------------------------------------------------------------------
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df['sumOf+veSentimentForAllTweets']], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
testVARFitting(train_df, test_df)
For 1# Signals and for Lag Order = 1 MSE: 5.36 MAE: 2.08
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 2 MSE: 5.61 MAE: 2.08
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 3 MSE: 0.81 MAE: 0.81
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 4 MSE: 4.30 MAE: 1.88
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 5 MSE: 7.58 MAE: 2.52
----------------------------------------------------------------------------------------------------
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df['sumOf+veSentimentForInflationTweets']], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
testVARFitting(train_df, test_df)
For 1# Signals and for Lag Order = 1 MSE: 8.81 MAE: 2.64
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 2 MSE: 12.12 MAE: 2.99
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 3 MSE: 1.55 MAE: 1.10
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 4 MSE: 13.67 MAE: 2.76
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 5 MSE: 14.81 MAE: 2.79
----------------------------------------------------------------------------------------------------
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df['sumOf-veSentimentForInflationTweets']], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
testVARFitting(train_df, test_df)
For 1# Signals and for Lag Order = 1 MSE: 2.45 MAE: 1.41
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 2 MSE: 1.65 MAE: 1.13
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 3 MSE: 0.62 MAE: 0.64
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 4 MSE: 23.90 MAE: 3.90
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 5 MSE: 34.09 MAE: 4.59
----------------------------------------------------------------------------------------------------
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df['sumOfallTweetsTop15Unigrams']], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
testVARFitting(train_df, test_df)
For 1# Signals and for Lag Order = 1 MSE: 5.54 MAE: 2.11
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 2 MSE: 3.57 MAE: 1.69
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 3 MSE: 1.12 MAE: 0.79
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 4 MSE: 145.82 MAE: 8.92
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 5 MSE: 193.07 MAE: 10.22
----------------------------------------------------------------------------------------------------
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df['meanOfinflationTweetsTop15Bigrams']], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
testVARFitting(train_df, test_df)
For 1# Signals and for Lag Order = 1 MSE: 2.60 MAE: 1.45
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 2 MSE: 1.61 MAE: 1.10
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 3 MSE: 3.56 MAE: 1.70
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 4 MSE: 0.78 MAE: 0.76
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 5 MSE: 0.83 MAE: 0.81
----------------------------------------------------------------------------------------------------
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df['sumOfinflationTweetsTop15Bigrams']], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
testVARFitting(train_df, test_df)
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
For 1# Signals and for Lag Order = 1 MSE: 2.86 MAE: 1.53
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 2 MSE: 1.14 MAE: 0.90
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 3 MSE: 0.40 MAE: 0.52
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 4 MSE: 26.39 MAE: 3.96
---------------------------------------------------------------------------------------------------- For 1# Signals and for Lag Order = 5 MSE: 36.47 MAE: 4.63
----------------------------------------------------------------------------------------------------
threshold = 0.90
temp = pd.concat([indicators_df, CPIH_annualRate], axis=1).corr()[CPIH_annualRate.columns[0]].apply(lambda x: x if abs(x) >= threshold else None)
temp = temp[~temp.isna()]
highly_correlated_signals = temp.index.to_list()[:-1]
print(len(highly_correlated_signals))
2
temp
countOfInflationTweets 0.934124 sumOf-veSentimentForInflationTweets -0.921260 CPIH_annualRate 1.000000 Name: CPIH_annualRate, dtype: float64
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df[highly_correlated_signals]], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
testVARFitting(train_df, test_df)
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
For 2# Signals and for Lag Order = 1 MSE: 4.24 MAE: 1.86
---------------------------------------------------------------------------------------------------- For 2# Signals and for Lag Order = 2 MSE: 10.51 MAE: 2.80
---------------------------------------------------------------------------------------------------- For 2# Signals and for Lag Order = 3 MSE: 1.61 MAE: 1.12
---------------------------------------------------------------------------------------------------- For 2# Signals and for Lag Order = 4 MSE: 20.44 MAE: 3.47
---------------------------------------------------------------------------------------------------- For 2# Signals and for Lag Order = 5 MSE: 19.00 MAE: 3.35
----------------------------------------------------------------------------------------------------
threshold = 0.895
temp = pd.concat([indicators_df, CPIH_annualRate], axis=1).corr()[CPIH_annualRate.columns[0]].apply(lambda x: x if abs(x) >= threshold else None)
temp = temp[~temp.isna()]
highly_correlated_signals = temp.index.to_list()[:-1]
print(len(highly_correlated_signals))
3
temp
countOfInflationTweets 0.934124 sumOf+veSentimentForInflationTweets 0.896403 sumOf-veSentimentForInflationTweets -0.921260 CPIH_annualRate 1.000000 Name: CPIH_annualRate, dtype: float64
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df[highly_correlated_signals]], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
testVARFitting(train_df, test_df)
For 3# Signals and for Lag Order = 1 MSE: 7.70 MAE: 2.45
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
---------------------------------------------------------------------------------------------------- For 3# Signals and for Lag Order = 2 MSE: 12.92 MAE: 3.01
---------------------------------------------------------------------------------------------------- For 3# Signals and for Lag Order = 3 MSE: 0.39 MAE: 0.51
---------------------------------------------------------------------------------------------------- For 3# Signals and for Lag Order = 4 MSE: 44.36 MAE: 5.16
---------------------------------------------------------------------------------------------------- For 3# Signals and for Lag Order = 5 MSE: 44.38 MAE: 5.13
----------------------------------------------------------------------------------------------------
threshold = 0.87
temp = pd.concat([indicators_df, CPIH_annualRate], axis=1).corr()[CPIH_annualRate.columns[0]].apply(lambda x: x if abs(x) >= threshold else None)
temp = temp[~temp.isna()]
highly_correlated_signals = temp.index.to_list()[:-1]
print(len(highly_correlated_signals))
8
temp
countOfAllTweets 0.893800 sumOf+veSentimentForAllTweets 0.890000 sumOf-veSentimentForAllTweets -0.874672 countOfInflationTweets 0.934124 sumOf+veSentimentForInflationTweets 0.896403 sumOf-veSentimentForInflationTweets -0.921260 countOfProfessionalsTweets 0.886528 countOfProfessionalsInflationTweets 0.876029 CPIH_annualRate 1.000000 Name: CPIH_annualRate, dtype: float64
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df[highly_correlated_signals]], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
testVARFitting(train_df, test_df)
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
For 8# Signals and for Lag Order = 1 MSE: 6.61 MAE: 2.22
---------------------------------------------------------------------------------------------------- For 8# Signals and for Lag Order = 2 MSE: 1.73 MAE: 1.13
---------------------------------------------------------------------------------------------------- For 8# Signals and for Lag Order = 3 MSE: 8.36 MAE: 2.38
---------------------------------------------------------------------------------------------------- For 8# Signals and for Lag Order = 4 MSE: 142.60 MAE: 10.07
---------------------------------------------------------------------------------------------------- For 8# Signals and for Lag Order = 5 MSE: 329.20 MAE: 12.32
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py:1242: RuntimeWarning: invalid value encountered in sqrt sigma = np.sqrt(self._forecast_vars(steps))
----------------------------------------------------------------------------------------------------
threshold = 0.85
temp = pd.concat([indicators_df, CPIH_annualRate], axis=1).corr()[CPIH_annualRate.columns[0]].apply(lambda x: x if abs(x) >= threshold else None)
temp = temp[~temp.isna()]
highly_correlated_signals = temp.index.to_list()[:-1]
print(len(highly_correlated_signals))
14
temp
countOfAllTweets 0.893800 sumOfCompoundSentimentForAllTweets 0.855161 sumOf+veSentimentForAllTweets 0.890000 sumOf-veSentimentForAllTweets -0.874672 countOfInflationTweets 0.934124 sumOf+veSentimentForInflationTweets 0.896403 sumOf-veSentimentForInflationTweets -0.921260 countOfProfessionalsTweets 0.886528 sumOf+veSentimentForProfessionalsTweets 0.864849 countOfProfessionalsInflationTweets 0.876029 sumOfallTweetsTop15Unigrams 0.868673 sumOfinflationTweetsTop15Unigrams 0.851980 sumOfinflationTweetsTop15Bigrams 0.853064 meanOfinflationTweetsTop15Bigrams 0.866670 CPIH_annualRate 1.000000 Name: CPIH_annualRate, dtype: float64
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df[highly_correlated_signals]], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
testVARFitting(train_df, test_df)
For 14# Signals and for Lag Order = 1 MSE: 2.51 MAE: 1.43
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
---------------------------------------------------------------------------------------------------- For 14# Signals and for Lag Order = 2 MSE: 0.43 MAE: 0.56
---------------------------------------------------------------------------------------------------- For 14# Signals and for Lag Order = 3 MSE: 3226.05 MAE: 44.17
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py:1242: RuntimeWarning: invalid value encountered in sqrt sigma = np.sqrt(self._forecast_vars(steps))
---------------------------------------------------------------------------------------------------- For 14# Signals and for Lag Order = 4 MSE: 547.75 MAE: 16.07
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py:1242: RuntimeWarning: invalid value encountered in sqrt sigma = np.sqrt(self._forecast_vars(steps))
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py:1242: RuntimeWarning: invalid value encountered in sqrt sigma = np.sqrt(self._forecast_vars(steps))
---------------------------------------------------------------------------------------------------- For 14# Signals and for Lag Order = 5 MSE: 21.00 MAE: 4.09
----------------------------------------------------------------------------------------------------
threshold = 0.80
temp = pd.concat([indicators_df, CPIH_annualRate], axis=1).corr()[CPIH_annualRate.columns[0]].apply(lambda x: x if abs(x) >= threshold else None)
temp = temp[~temp.isna()]
highly_correlated_signals = temp.index.to_list()[:-1]
print(len(highly_correlated_signals))
16
temp
countOfAllTweets 0.893800 sumOfCompoundSentimentForAllTweets 0.855161 sumOf+veSentimentForAllTweets 0.890000 sumOf-veSentimentForAllTweets -0.874672 countOfInflationTweets 0.934124 sumOf+veSentimentForInflationTweets 0.896403 sumOf-veSentimentForInflationTweets -0.921260 countOfProfessionalsTweets 0.886528 sumOf+veSentimentForProfessionalsTweets 0.864849 sumOf-veSentimentForProfessionalsTweets -0.842794 countOfProfessionalsInflationTweets 0.876029 sumOfallTweetsTop15Unigrams 0.868673 sumOfinflationTweetsTop15Unigrams 0.851980 sumOfinflationTweetsTop15Bigrams 0.853064 sumOfinflationProfessionalTweetsTop15Unigrams 0.845270 meanOfinflationTweetsTop15Bigrams 0.866670 CPIH_annualRate 1.000000 Name: CPIH_annualRate, dtype: float64
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df[highly_correlated_signals]], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
testVARFitting(train_df, test_df)
For 16# Signals and for Lag Order = 1 MSE: 0.92 MAE: 0.89
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
---------------------------------------------------------------------------------------------------- For 16# Signals and for Lag Order = 2 MSE: 12.87 MAE: 3.13
---------------------------------------------------------------------------------------------------- For 16# Signals and for Lag Order = 3 MSE: 960.76 MAE: 22.56
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py:1242: RuntimeWarning: invalid value encountered in sqrt sigma = np.sqrt(self._forecast_vars(steps))
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py:1242: RuntimeWarning: invalid value encountered in sqrt sigma = np.sqrt(self._forecast_vars(steps))
---------------------------------------------------------------------------------------------------- For 16# Signals and for Lag Order = 4 MSE: 36.56 MAE: 4.61
---------------------------------------------------------------------------------------------------- For 16# Signals and for Lag Order = 5 MSE: 34.74 MAE: 5.52
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py:1242: RuntimeWarning: invalid value encountered in sqrt sigma = np.sqrt(self._forecast_vars(steps))
----------------------------------------------------------------------------------------------------
manuallySelectedSignals = [
'countOfAllTweets',
'countOfInflationTweets',
'sumOfCompoundSentimentForAllTweets',
'sumOf+veSentimentForAllTweets',
'sumOf+veSentimentForInflationTweets',
'sumOf-veSentimentForInflationTweets',
'sumOfallTweetsTop15Unigrams',
'meanOfinflationTweetsTop15Bigrams',
'sumOfinflationTweetsTop15Bigrams'
]
import itertools
def all_combinations(lst):
all_combinations_list = []
for r in range(1, len(lst) + 1):
# Generate combinations of length 'r' from the list 'lst'
combinations = itertools.combinations(lst, r)
# Convert the iterator to a list and extend the all_combinations_list
all_combinations_list.extend(list(combinations))
return all_combinations_list
# Get all possible combinations of the list
result = all_combinations(manuallySelectedSignals)
result = list(map(list, result))
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings('ignore')
lowestMSEPerCombination = [np.inf]*9
lagOflowestMSEPerCombination = [np.inf]*9
combinationOflowestMSEPerCombination = [np.inf]*9
for manualCompination in tqdm(result):
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df[manualCompination]], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
lowestMSE, lagOfLowestMSE = testVARFitting(train_df, test_df, returnNum=True)
if lowestMSE < lowestMSEPerCombination[len(manualCompination)-1]:
lowestMSEPerCombination[len(manualCompination)-1] = lowestMSE
lagOflowestMSEPerCombination[len(manualCompination)-1] = lagOfLowestMSE
combinationOflowestMSEPerCombination[len(manualCompination)-1] = manualCompination
# print('='*100)
0%| | 0/511 [00:00<?, ?it/s]
lowestMSEPerCombination
[0.3998287526876619, 0.2785407851030476, 0.16019522420990934, 0.11426713894398344, 0.23610625880220723, 0.22603512059429418, 0.2421631444265442, 0.3235139420161793, 1.5968220794102965]
lagOflowestMSEPerCombination
[3, 3, 3, 4, 3, 3, 3, 3, 2]
combinationOflowestMSEPerCombination
[['sumOfinflationTweetsTop15Bigrams'], ['countOfAllTweets', 'sumOfallTweetsTop15Unigrams'], ['countOfInflationTweets', 'sumOfCompoundSentimentForAllTweets', 'sumOfinflationTweetsTop15Bigrams'], ['countOfInflationTweets', 'sumOf+veSentimentForAllTweets', 'sumOf-veSentimentForInflationTweets', 'meanOfinflationTweetsTop15Bigrams'], ['countOfAllTweets', 'countOfInflationTweets', 'sumOf+veSentimentForInflationTweets', 'sumOfallTweetsTop15Unigrams', 'meanOfinflationTweetsTop15Bigrams'], ['countOfAllTweets', 'countOfInflationTweets', 'sumOf+veSentimentForInflationTweets', 'sumOf-veSentimentForInflationTweets', 'sumOfallTweetsTop15Unigrams', 'meanOfinflationTweetsTop15Bigrams'], ['countOfAllTweets', 'countOfInflationTweets', 'sumOfCompoundSentimentForAllTweets', 'sumOf+veSentimentForInflationTweets', 'sumOf-veSentimentForInflationTweets', 'sumOfallTweetsTop15Unigrams', 'meanOfinflationTweetsTop15Bigrams'], ['countOfAllTweets', 'countOfInflationTweets', 'sumOfCompoundSentimentForAllTweets', 'sumOf+veSentimentForAllTweets', 'sumOf+veSentimentForInflationTweets', 'sumOf-veSentimentForInflationTweets', 'sumOfallTweetsTop15Unigrams', 'meanOfinflationTweetsTop15Bigrams'], ['countOfAllTweets', 'countOfInflationTweets', 'sumOfCompoundSentimentForAllTweets', 'sumOf+veSentimentForAllTweets', 'sumOf+veSentimentForInflationTweets', 'sumOf-veSentimentForInflationTweets', 'sumOfallTweetsTop15Unigrams', 'meanOfinflationTweetsTop15Bigrams', 'sumOfinflationTweetsTop15Bigrams']]
manualCompination
['sumOfinflationTweetsTop15Bigrams']
for idx in range(len(combinationOflowestMSEPerCombination)):
specifyLag = lagOflowestMSEPerCombination[idx]
manualCompination = combinationOflowestMSEPerCombination[idx]
indicators_and_infation = pd.concat([CPIH_annualRate, indicators_df[manualCompination]], axis=1)
train_df = indicators_and_infation[:'2021-12-01']
test_df = indicators_and_infation['2022-01-01':]
testVARFitting(train_df, test_df, specifyLag=specifyLag)
For 1# Signals and for Lag Order = 3 MSE: 0.40 MAE: 0.52
---------------------------------------------------------------------------------------------------- For 2# Signals and for Lag Order = 3 MSE: 0.28 MAE: 0.45
---------------------------------------------------------------------------------------------------- For 3# Signals and for Lag Order = 3 MSE: 0.16 MAE: 0.34
---------------------------------------------------------------------------------------------------- For 4# Signals and for Lag Order = 4 MSE: 0.11 MAE: 0.27
---------------------------------------------------------------------------------------------------- For 5# Signals and for Lag Order = 3 MSE: 0.24 MAE: 0.35
---------------------------------------------------------------------------------------------------- For 6# Signals and for Lag Order = 3 MSE: 0.23 MAE: 0.42
---------------------------------------------------------------------------------------------------- For 7# Signals and for Lag Order = 3 MSE: 0.24 MAE: 0.43
---------------------------------------------------------------------------------------------------- For 8# Signals and for Lag Order = 3 MSE: 0.32 MAE: 0.49
---------------------------------------------------------------------------------------------------- For 9# Signals and for Lag Order = 2 MSE: 1.60 MAE: 1.12
----------------------------------------------------------------------------------------------------
pd.concat([CPI_allItems, CPIH_annualRate, CPI_energy], axis=1).to_csv('inflationIndexes.csv')
df_DL = pd.concat([indicators_df, CPI_allItems, CPIH_annualRate, CPI_energy], axis=1)
target = ['CPI_allItems', 'CPIH_annualRate', 'CPI_energy']
# shift_days = 1
shift_steps = 1 #shift_days * 24 # Number of hours.
df_targets = df_DL[target].shift(-shift_steps)
df_DL[target].head(shift_steps + 5)
| CPI_allItems | CPIH_annualRate | CPI_energy | |
|---|---|---|---|
| 2018-01-01 | 104.5 | 2.71 | 106.6 |
| 2018-02-01 | 104.9 | 2.45 | 106.5 |
| 2018-03-01 | 105.1 | 2.29 | 105.9 |
| 2018-04-01 | 105.5 | 2.20 | 106.9 |
| 2018-05-01 | 105.9 | 2.30 | 108.9 |
| 2018-06-01 | 105.9 | 2.30 | 111.2 |
df_DL[target].head()
| CPI_allItems | CPIH_annualRate | CPI_energy | |
|---|---|---|---|
| 2018-01-01 | 104.5 | 2.71 | 106.6 |
| 2018-02-01 | 104.9 | 2.45 | 106.5 |
| 2018-03-01 | 105.1 | 2.29 | 105.9 |
| 2018-04-01 | 105.5 | 2.20 | 106.9 |
| 2018-05-01 | 105.9 | 2.30 | 108.9 |
df_targets.tail()
| CPI_allItems | CPIH_annualRate | CPI_energy | |
|---|---|---|---|
| 2022-08-01 | 122.3 | 8.81 | 173.0 |
| 2022-09-01 | 124.3 | 9.59 | 197.8 |
| 2022-10-01 | 124.8 | 9.35 | 198.1 |
| 2022-11-01 | 125.3 | 9.24 | 194.4 |
| 2022-12-01 | NaN | NaN | NaN |
# Rhe input-samples
x_data = df_DL.values[0:-shift_steps]
# The target-samples
y_data = df_targets.values[:-shift_steps]
# The number of samples in the data-set:
num_data = len(x_data)
num_data
59
# TThe fraction of the data-set that will be used for the training-set
train_split = 0.8
# The number of samples in the training-set:
num_train = int(train_split * num_data)
num_train
47
# The number of samples in the test-set:
num_test = num_data - num_train
num_test
12
# the training-samples for the training and test-sets
x_train = x_data[0:num_train]
x_test = x_data[num_train:]
# the testing-samples for the training and test-sets
y_train = y_data[0:num_train]
y_test = y_data[num_train:]
# Number of features
num_x_signals = x_data.shape[1]
num_x_signals
49
# This is the number of targets
num_y_signals = y_data.shape[1]
num_y_signals
3
from sklearn.preprocessing import MinMaxScaler
x_scaler = MinMaxScaler()
x_train_scaled = x_scaler.fit_transform(x_train)
x_test_scaled = x_scaler.transform(x_test)
y_scaler = MinMaxScaler()
y_train_scaled = y_scaler.fit_transform(y_train)
y_test_scaled = y_scaler.transform(y_test)
print(x_train_scaled.shape)
print(y_train_scaled.shape)
(47, 49) (47, 3)
def batch_generator(batch_size, sequence_length):
"""
Generator function for creating random batches of training-data.
"""
# Infinite loop.
while True:
# Allocate a new array for the batch of input-signals.
x_shape = (batch_size, sequence_length, num_x_signals)
x_batch = np.zeros(shape=x_shape, dtype=np.float16)
# Allocate a new array for the batch of output-signals.
y_shape = (batch_size, sequence_length, num_y_signals)
y_batch = np.zeros(shape=y_shape, dtype=np.float16)
# Fill the batch with random sequences of data.
for i in range(batch_size):
# Get a random start-index.
# This points somewhere into the training-data.
idx = np.random.randint(num_train - sequence_length)
# Copy the sequences of data starting at this index.
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
yield (x_batch, y_batch)
batch_size = 2
sequence_length = 24
# create the batch-generator
generator = batch_generator(batch_size=batch_size,
sequence_length=sequence_length)
# test the batch-generator to see if it works
x_batch, y_batch = next(generator)
print(x_batch.shape)
print(y_batch.shape)
(2, 24, 49) (2, 24, 3)
x_train_scaled.shape
(47, 49)
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, GRU, Embedding
from tensorflow.keras.optimizers import RMSprop, Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard, ReduceLROnPlateau
from tensorflow.keras.backend import square, mean
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(log_device_placement=True))
Num GPUs Available: 1 Device mapping: /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: NVIDIA GeForce RTX 3050 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
<tensorflow.python.client.session.Session at 0x2527ee3f208>
validation_data = (np.expand_dims(x_test_scaled, axis=0),
np.expand_dims(y_test_scaled, axis=0))
def loss_mse_warmup(y_true, y_pred):
"""
Calculate the Mean Squared Error between y_true and y_pred,
but ignore the beginning "warmup" part of the sequences.
y_true is the desired output.
y_pred is the model's output.
"""
# The shape of both input tensors are:
# [batch_size, sequence_length, num_y_signals].
# Ignore the "warmup" parts of the sequences
# by taking slices of the tensors.
y_true_slice = y_true[:, warmup_steps:, :]
y_pred_slice = y_pred[:, warmup_steps:, :]
# These sliced tensors both have this shape:
# [batch_size, sequence_length - warmup_steps, num_y_signals]
# Calculat the Mean Squared Error and use it as loss.
mse = mean(square(y_true_slice - y_pred_slice))
return mse
warmup_steps = 3
model = Sequential()
model.add(GRU(units=512,
return_sequences=True,
input_shape=(None, num_x_signals,)))
model.add(Dense(num_y_signals, activation='sigmoid'))
optimizer = RMSprop(lr=1e-5)
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\keras\optimizers\optimizer_v2\rmsprop.py:140: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead. super().__init__(name, **kwargs)
model.compile(loss=loss_mse_warmup, optimizer=optimizer)
model.summary()
Model: "sequential_6"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
gru_6 (GRU) (None, None, 512) 864768
dense_6 (Dense) (None, None, 3) 1539
=================================================================
Total params: 866,307
Trainable params: 866,307
Non-trainable params: 0
_________________________________________________________________
path_checkpoint = './checkpoint.keras'
callback_checkpoint = ModelCheckpoint(filepath=path_checkpoint,
monitor='val_loss',
verbose=1,
save_weights_only=True,
save_best_only=True)
callback_early_stopping = EarlyStopping(monitor='val_loss',
patience=5, verbose=1)
callback_tensorboard = TensorBoard(log_dir='./logs/',
histogram_freq=0,
write_graph=False)
callback_reduce_lr = ReduceLROnPlateau(monitor='val_loss',
factor=0.1,
min_lr=1e-10,
patience=1,
verbose=1)
callbacks = [
callback_early_stopping,
callback_checkpoint,
callback_tensorboard,
callback_reduce_lr
]
%%time
model.fit(x=generator,
epochs=100,
steps_per_epoch=1,
validation_data=validation_data,
callbacks=callbacks)
Epoch 1/100 1/1 [==============================] - ETA: 0s - loss: 0.0567 Epoch 1: val_loss improved from inf to 3.52649, saving model to .\checkpoint.keras 1/1 [==============================] - 3s 3s/step - loss: 0.0567 - val_loss: 3.5265 - lr: 1.0000e-05 Epoch 2/100 1/1 [==============================] - ETA: 0s - loss: 0.0571 Epoch 2: val_loss did not improve from 3.52649 Epoch 2: ReduceLROnPlateau reducing learning rate to 9.999999747378752e-07. 1/1 [==============================] - 0s 102ms/step - loss: 0.0571 - val_loss: 3.5431 - lr: 1.0000e-05 Epoch 3/100 1/1 [==============================] - ETA: 0s - loss: 0.0706 Epoch 3: val_loss did not improve from 3.52649 Epoch 3: ReduceLROnPlateau reducing learning rate to 9.999999974752428e-08. 1/1 [==============================] - 0s 102ms/step - loss: 0.0706 - val_loss: 3.5447 - lr: 1.0000e-06 Epoch 4/100 1/1 [==============================] - ETA: 0s - loss: 0.0886 Epoch 4: val_loss did not improve from 3.52649 Epoch 4: ReduceLROnPlateau reducing learning rate to 1.0000000116860975e-08. 1/1 [==============================] - 0s 103ms/step - loss: 0.0886 - val_loss: 3.5448 - lr: 1.0000e-07 Epoch 5/100 1/1 [==============================] - ETA: 0s - loss: 0.0747 Epoch 5: val_loss did not improve from 3.52649 Epoch 5: ReduceLROnPlateau reducing learning rate to 9.999999939225292e-10. 1/1 [==============================] - 0s 137ms/step - loss: 0.0747 - val_loss: 3.5448 - lr: 1.0000e-08 Epoch 6/100 1/1 [==============================] - ETA: 0s - loss: 0.0809 Epoch 6: val_loss did not improve from 3.52649 Epoch 6: ReduceLROnPlateau reducing learning rate to 1e-10. 1/1 [==============================] - 0s 108ms/step - loss: 0.0809 - val_loss: 3.5448 - lr: 1.0000e-09 Epoch 6: early stopping Wall time: 3.32 s
<keras.callbacks.History at 0x251c43632c8>
result = model.evaluate(x=np.expand_dims(x_test_scaled, axis=0),
y=np.expand_dims(y_test_scaled, axis=0))
1/1 [==============================] - 0s 36ms/step - loss: 3.5448
def plot_comparison(start_idx, length=100, train=True):
"""
Plot the predicted and true output-signals.
:param start_idx: Start-index for the time-series.
:param length: Sequence-length to process and plot.
:param train: Boolean whether to use training- or test-set.
"""
if train:
# Use training-data.
x = x_train_scaled
y_true = y_train
else:
# Use test-data.
x = x_test_scaled
y_true = y_test
# End-index for the sequences.
end_idx = start_idx + length
# Select the sequences from the given start-index and
# of the given length.
x = x[start_idx:end_idx]
y_true = y_true[start_idx:end_idx]
# Input-signals for the model.
x = np.expand_dims(x, axis=0)
# Use the model to predict the output-signals.
y_pred = model.predict(x)
# The output of the model is between 0 and 1.
# Do an inverse map to get it back to the scale
# of the original data-set.
y_pred_rescaled = y_scaler.inverse_transform(y_pred[0])
# For each output-signal.
for signal in range(len(target)):
# Get the output-signal predicted by the model.
signal_pred = y_pred_rescaled[:, signal]
# Get the true output-signal from the data-set.
signal_true = y_true[:, signal]
# Make the plotting-canvas bigger.
plt.figure(figsize=(15,7))
# Plot and compare the two signals.
plt.plot(signal_true, label='true')
plt.plot(signal_pred, label='pred')
# Plot grey box for warmup-period.
p = plt.axvspan(0, warmup_steps, facecolor='black', alpha=0.15)
# Plot labels etc.
plt.ylabel(target[signal])
plt.legend()
plt.show()
plot_comparison(start_idx=0, length=18, train=True)
1/1 [==============================] - 0s 267ms/step
plot_comparison(start_idx=0, length=12, train=False)
1/1 [==============================] - 0s 299ms/step
start_idx = 0
length = 12
# Use test-data.
x = x_test_scaled
y_true = y_test
# End-index for the sequences.
end_idx = start_idx + length
# Select the sequences from the given start-index and
# of the given length.
x = x[start_idx:end_idx]
y_true = y_true[start_idx:end_idx]
# Input-signals for the model.
x = np.expand_dims(x, axis=0)
# Use the model to predict the output-signals.
y_pred = model.predict(x)
# The output of the model is between 0 and 1.
# Do an inverse map to get it back to the scale
# of the original data-set.
y_pred_rescaled = y_scaler.inverse_transform(y_pred[0])
# For each output-signal.
for signal in range(len(target)):
if target[signal] != 'CPIH_annualRate': continue
# Get the output-signal predicted by the model.
signal_pred = y_pred_rescaled[:, signal]
# Get the true output-signal from the data-set.
signal_true = y_true[:, signal]
# Make the plotting-canvas bigger.
plt.figure(figsize=(15,7))
# Plot and compare the two signals.
plt.plot(signal_true, label='true')
plt.plot(signal_pred, label='pred')
# Plot grey box for warmup-period.
p = plt.axvspan(0, warmup_steps, facecolor='black', alpha=0.15)
# Plot labels etc.
plt.ylabel(target[signal])
plt.legend()
plt.show()
1/1 [==============================] - 0s 35ms/step
np.mean((signal_true - signal_pred)**2)
34.19055362337526
df_DL = pd.concat([indicators_df, CPI_allItems, CPIH_annualRate_withLags, CPI_energy], axis=1)
target = ['CPI_allItems', 'CPIH_annualRate', 'CPI_energy']
# shift_days = 1
shift_steps = 1 #shift_days * 24 # Number of hours.
df_targets = df_DL[target].shift(-shift_steps)
df_DL[target].head(shift_steps + 5)
| CPI_allItems | CPIH_annualRate | CPI_energy | |
|---|---|---|---|
| 2018-01-01 | 104.5 | 2.71 | 106.6 |
| 2018-02-01 | 104.9 | 2.45 | 106.5 |
| 2018-03-01 | 105.1 | 2.29 | 105.9 |
| 2018-04-01 | 105.5 | 2.20 | 106.9 |
| 2018-05-01 | 105.9 | 2.30 | 108.9 |
| 2018-06-01 | 105.9 | 2.30 | 111.2 |
df_DL[target].head()
| CPI_allItems | CPIH_annualRate | CPI_energy | |
|---|---|---|---|
| 2018-01-01 | 104.5 | 2.71 | 106.6 |
| 2018-02-01 | 104.9 | 2.45 | 106.5 |
| 2018-03-01 | 105.1 | 2.29 | 105.9 |
| 2018-04-01 | 105.5 | 2.20 | 106.9 |
| 2018-05-01 | 105.9 | 2.30 | 108.9 |
df_targets.tail()
| CPI_allItems | CPIH_annualRate | CPI_energy | |
|---|---|---|---|
| 2022-08-01 | 122.3 | 8.81 | 173.0 |
| 2022-09-01 | 124.3 | 9.59 | 197.8 |
| 2022-10-01 | 124.8 | 9.35 | 198.1 |
| 2022-11-01 | 125.3 | 9.24 | 194.4 |
| 2022-12-01 | NaN | NaN | NaN |
# Rhe input-samples
x_data = df_DL.values[0:-shift_steps]
# The target-samples
y_data = df_targets.values[:-shift_steps]
# The number of samples in the data-set:
num_data = len(x_data)
num_data
59
# TThe fraction of the data-set that will be used for the training-set
train_split = 0.8
# The number of samples in the training-set:
num_train = int(train_split * num_data)
num_train
47
# The number of samples in the test-set:
num_test = num_data - num_train
num_test
12
# the training-samples for the training and test-sets
x_train = x_data[0:num_train]
x_test = x_data[num_train:]
# the testing-samples for the training and test-sets
y_train = y_data[0:num_train]
y_test = y_data[num_train:]
# Number of features
num_x_signals = x_data.shape[1]
num_x_signals
55
# This is the number of targets
num_y_signals = y_data.shape[1]
num_y_signals
3
from sklearn.preprocessing import MinMaxScaler
x_scaler = MinMaxScaler()
x_train_scaled = x_scaler.fit_transform(x_train)
x_test_scaled = x_scaler.transform(x_test)
y_scaler = MinMaxScaler()
y_train_scaled = y_scaler.fit_transform(y_train)
y_test_scaled = y_scaler.transform(y_test)
print(x_train_scaled.shape)
print(y_train_scaled.shape)
(47, 55) (47, 3)
def batch_generator(batch_size, sequence_length):
"""
Generator function for creating random batches of training-data.
"""
# Infinite loop.
while True:
# Allocate a new array for the batch of input-signals.
x_shape = (batch_size, sequence_length, num_x_signals)
x_batch = np.zeros(shape=x_shape, dtype=np.float16)
# Allocate a new array for the batch of output-signals.
y_shape = (batch_size, sequence_length, num_y_signals)
y_batch = np.zeros(shape=y_shape, dtype=np.float16)
# Fill the batch with random sequences of data.
for i in range(batch_size):
# Get a random start-index.
# This points somewhere into the training-data.
idx = np.random.randint(num_train - sequence_length)
# Copy the sequences of data starting at this index.
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
yield (x_batch, y_batch)
batch_size = 36
sequence_length = 36
# create the batch-generator
generator = batch_generator(batch_size=batch_size,
sequence_length=sequence_length)
# test the batch-generator to see if it works
x_batch, y_batch = next(generator)
print(x_batch.shape)
print(y_batch.shape)
(36, 36, 55) (36, 36, 3)
x_train_scaled.shape
(47, 55)
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, GRU, Embedding
from tensorflow.keras.optimizers import RMSprop, Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard, ReduceLROnPlateau
from tensorflow.keras.backend import square, mean
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(log_device_placement=True))
Num GPUs Available: 1 Device mapping: /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: NVIDIA GeForce RTX 3050 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
<tensorflow.python.client.session.Session at 0x252a15d7a88>
validation_data = (np.expand_dims(x_test_scaled, axis=0),
np.expand_dims(y_test_scaled, axis=0))
def loss_mse_warmup(y_true, y_pred):
"""
Calculate the Mean Squared Error between y_true and y_pred,
but ignore the beginning "warmup" part of the sequences.
y_true is the desired output.
y_pred is the model's output.
"""
# The shape of both input tensors are:
# [batch_size, sequence_length, num_y_signals].
# Ignore the "warmup" parts of the sequences
# by taking slices of the tensors.
y_true_slice = y_true[:, warmup_steps:, :]
y_pred_slice = y_pred[:, warmup_steps:, :]
# These sliced tensors both have this shape:
# [batch_size, sequence_length - warmup_steps, num_y_signals]
# Calculat the Mean Squared Error and use it as loss.
mse = mean(square(y_true_slice - y_pred_slice))
return mse
warmup_steps = 3
model = Sequential()
model.add(GRU(units=512,
return_sequences=True,
input_shape=(None, num_x_signals,)))
model.add(Dense(num_y_signals, activation='sigmoid'))
optimizer = RMSprop(lr=1e-7)
model.compile(loss=loss_mse_warmup, optimizer=optimizer)
model.summary()
Model: "sequential_5"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
gru_5 (GRU) (None, None, 512) 873984
dense_5 (Dense) (None, None, 3) 1539
=================================================================
Total params: 875,523
Trainable params: 875,523
Non-trainable params: 0
_________________________________________________________________
path_checkpoint = './checkpoint.keras'
callback_checkpoint = ModelCheckpoint(filepath=path_checkpoint,
monitor='val_loss',
verbose=1,
save_weights_only=True,
save_best_only=True)
callback_early_stopping = EarlyStopping(monitor='val_loss',
patience=5, verbose=1)
callback_tensorboard = TensorBoard(log_dir='./logs/',
histogram_freq=0,
write_graph=False)
callback_reduce_lr = ReduceLROnPlateau(monitor='val_loss',
factor=0.1,
min_lr=1e-10,
patience=1,
verbose=1)
callbacks = [
callback_early_stopping,
callback_checkpoint,
callback_tensorboard,
callback_reduce_lr
]
%%time
model.fit(x=generator,
epochs=100,
steps_per_epoch=1,
validation_data=validation_data,
callbacks=callbacks)
Epoch 1/100 1/1 [==============================] - ETA: 0s - loss: 0.0682 Epoch 1: val_loss improved from inf to 3.29529, saving model to .\checkpoint.keras 1/1 [==============================] - 2s 2s/step - loss: 0.0682 - val_loss: 3.2953 - lr: 1.0000e-07 Epoch 2/100 1/1 [==============================] - ETA: 0s - loss: 0.0682 Epoch 2: val_loss did not improve from 3.29529 Epoch 2: ReduceLROnPlateau reducing learning rate to 1.0000000116860975e-08. 1/1 [==============================] - 0s 134ms/step - loss: 0.0682 - val_loss: 3.2955 - lr: 1.0000e-07 Epoch 3/100 1/1 [==============================] - ETA: 0s - loss: 0.0682 Epoch 3: val_loss did not improve from 3.29529 Epoch 3: ReduceLROnPlateau reducing learning rate to 9.999999939225292e-10. 1/1 [==============================] - 0s 137ms/step - loss: 0.0682 - val_loss: 3.2955 - lr: 1.0000e-08 Epoch 4/100 1/1 [==============================] - ETA: 0s - loss: 0.0678 Epoch 4: val_loss did not improve from 3.29529 Epoch 4: ReduceLROnPlateau reducing learning rate to 1e-10. 1/1 [==============================] - 0s 139ms/step - loss: 0.0678 - val_loss: 3.2955 - lr: 1.0000e-09 Epoch 5/100 1/1 [==============================] - ETA: 0s - loss: 0.0678 Epoch 5: val_loss did not improve from 3.29529 1/1 [==============================] - 0s 132ms/step - loss: 0.0678 - val_loss: 3.2955 - lr: 1.0000e-10 Epoch 6/100 1/1 [==============================] - ETA: 0s - loss: 0.0677 Epoch 6: val_loss did not improve from 3.29529 1/1 [==============================] - 0s 159ms/step - loss: 0.0677 - val_loss: 3.2955 - lr: 1.0000e-10 Epoch 6: early stopping Wall time: 3 s
<keras.callbacks.History at 0x252b172ea08>
result = model.evaluate(x=np.expand_dims(x_test_scaled, axis=0),
y=np.expand_dims(y_test_scaled, axis=0))
1/1 [==============================] - 0s 35ms/step - loss: 3.2955
def plot_comparison(start_idx, length=100, train=True):
"""
Plot the predicted and true output-signals.
:param start_idx: Start-index for the time-series.
:param length: Sequence-length to process and plot.
:param train: Boolean whether to use training- or test-set.
"""
if train:
# Use training-data.
x = x_train_scaled
y_true = y_train
else:
# Use test-data.
x = x_test_scaled
y_true = y_test
# End-index for the sequences.
end_idx = start_idx + length
# Select the sequences from the given start-index and
# of the given length.
x = x[start_idx:end_idx]
y_true = y_true[start_idx:end_idx]
# Input-signals for the model.
x = np.expand_dims(x, axis=0)
# Use the model to predict the output-signals.
y_pred = model.predict(x)
# The output of the model is between 0 and 1.
# Do an inverse map to get it back to the scale
# of the original data-set.
y_pred_rescaled = y_scaler.inverse_transform(y_pred[0])
# For each output-signal.
for signal in range(len(target)):
# Get the output-signal predicted by the model.
signal_pred = y_pred_rescaled[:, signal]
# Get the true output-signal from the data-set.
signal_true = y_true[:, signal]
# Make the plotting-canvas bigger.
plt.figure(figsize=(15,7))
# Plot and compare the two signals.
plt.plot(signal_true, label='true')
plt.plot(signal_pred, label='pred')
# Plot grey box for warmup-period.
p = plt.axvspan(0, warmup_steps, facecolor='black', alpha=0.15)
# Plot labels etc.
plt.ylabel(target[signal])
plt.legend()
plt.show()
plot_comparison(start_idx=0, length=60, train=True)
WARNING:tensorflow:5 out of the last 6 calls to <function Model.make_predict_function.<locals>.predict_function at 0x00000252A14DE4C8> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. 1/1 [==============================] - 0s 288ms/step
plot_comparison(start_idx=0, length=12, train=False)
WARNING:tensorflow:6 out of the last 7 calls to <function Model.make_predict_function.<locals>.predict_function at 0x00000252A14DE4C8> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. 1/1 [==============================] - 0s 358ms/step
start_idx = 0
length = 12
x = x_test_scaled
y_true = y_test
# End-index for the sequences.
end_idx = start_idx + length
# Select the sequences from the given start-index and
# of the given length.
x = x[start_idx:end_idx]
y_true = y_true[start_idx:end_idx]
# Input-signals for the model.
x = np.expand_dims(x, axis=0)
# Use the model to predict the output-signals.
y_pred = model.predict(x)
# The output of the model is between 0 and 1.
# Do an inverse map to get it back to the scale
# of the original data-set.
y_pred_rescaled = y_scaler.inverse_transform(y_pred[0])
# For each output-signal.
for signal in range(len(target)):
if target[signal] != 'CPIH_annualRate': continue
# Get the output-signal predicted by the model.
signal_pred = y_pred_rescaled[:, signal]
# Get the true output-signal from the data-set.
signal_true = y_true[:, signal]
# Make the plotting-canvas bigger.
plt.figure(figsize=(15,7))
# Plot and compare the two signals.
plt.plot(signal_true, label='true')
plt.plot(signal_pred, label='pred')
# Plot grey box for warmup-period.
p = plt.axvspan(0, warmup_steps, facecolor='black', alpha=0.15)
# Plot labels etc.
plt.ylabel(target[signal])
plt.legend()
plt.show()
1/1 [==============================] - 0s 34ms/step
np.mean((signal_true - signal_pred)**2)
22.904962383607426
pd.read_pickle('./month_agg_bert_inflation.pkl').shape
(60, 768)
month_agg_bert_inflation = pd.DataFrame(pd.read_pickle('./month_agg_bert_inflation.pkl'), index=indicators_df.index.to_list())
month_agg_bert_professionals_inflation = pd.DataFrame(pd.read_pickle('./month_agg_bert_professionals_inflation.pkl'), index=indicators_df.index.to_list())
month_agg_bert_professionals = pd.DataFrame(pd.read_pickle('./month_agg_bert_professionals.pkl'), index=indicators_df.index.to_list())
month_agg_bert = pd.DataFrame(pd.read_pickle('./month_agg_bert.pkl'), index=indicators_df.index.to_list())
pd.concat([CPI_allItems, CPIH_annualRate, CPI_energy], axis=1).to_csv('inflationIndexes.csv')
df_DL = pd.concat([CPI_allItems,
month_agg_bert_inflation, month_agg_bert_professionals_inflation,
month_agg_bert_professionals, month_agg_bert,
CPIH_annualRate, CPI_energy], axis=1)
target = ['CPI_allItems', 'CPIH_annualRate', 'CPI_energy']
# shift_days = 1
shift_steps = 1 #shift_days * 24 # Number of hours.
df_targets = df_DL[target].shift(-shift_steps)
df_DL[target].head(shift_steps + 5)
| CPI_allItems | CPIH_annualRate | CPI_energy | |
|---|---|---|---|
| 2018-01-01 | 104.5 | 2.71 | 106.6 |
| 2018-02-01 | 104.9 | 2.45 | 106.5 |
| 2018-03-01 | 105.1 | 2.29 | 105.9 |
| 2018-04-01 | 105.5 | 2.20 | 106.9 |
| 2018-05-01 | 105.9 | 2.30 | 108.9 |
| 2018-06-01 | 105.9 | 2.30 | 111.2 |
df_DL[target].head()
| CPI_allItems | CPIH_annualRate | CPI_energy | |
|---|---|---|---|
| 2018-01-01 | 104.5 | 2.71 | 106.6 |
| 2018-02-01 | 104.9 | 2.45 | 106.5 |
| 2018-03-01 | 105.1 | 2.29 | 105.9 |
| 2018-04-01 | 105.5 | 2.20 | 106.9 |
| 2018-05-01 | 105.9 | 2.30 | 108.9 |
df_targets.tail()
| CPI_allItems | CPIH_annualRate | CPI_energy | |
|---|---|---|---|
| 2022-08-01 | 122.3 | 8.81 | 173.0 |
| 2022-09-01 | 124.3 | 9.59 | 197.8 |
| 2022-10-01 | 124.8 | 9.35 | 198.1 |
| 2022-11-01 | 125.3 | 9.24 | 194.4 |
| 2022-12-01 | NaN | NaN | NaN |
# Rhe input-samples
x_data = df_DL.values[0:-shift_steps]
# The target-samples
y_data = df_targets.values[:-shift_steps]
# The number of samples in the data-set:
num_data = len(x_data)
num_data
59
# TThe fraction of the data-set that will be used for the training-set
train_split = 0.8
# The number of samples in the training-set:
num_train = int(train_split * num_data)
num_train
47
# The number of samples in the test-set:
num_test = num_data - num_train
num_test
12
# the training-samples for the training and test-sets
x_train = x_data[0:num_train]
x_test = x_data[num_train:]
# the testing-samples for the training and test-sets
y_train = y_data[0:num_train]
y_test = y_data[num_train:]
# Number of features
num_x_signals = x_data.shape[1]
num_x_signals
3075
# This is the number of targets
num_y_signals = y_data.shape[1]
num_y_signals
3
from sklearn.preprocessing import MinMaxScaler
x_scaler = MinMaxScaler()
x_train_scaled = x_scaler.fit_transform(x_train)
x_test_scaled = x_scaler.transform(x_test)
y_scaler = MinMaxScaler()
y_train_scaled = y_scaler.fit_transform(y_train)
y_test_scaled = y_scaler.transform(y_test)
print(x_train_scaled.shape)
print(y_train_scaled.shape)
(47, 3075) (47, 3)
def batch_generator(batch_size, sequence_length):
"""
Generator function for creating random batches of training-data.
"""
# Infinite loop.
while True:
# Allocate a new array for the batch of input-signals.
x_shape = (batch_size, sequence_length, num_x_signals)
x_batch = np.zeros(shape=x_shape, dtype=np.float16)
# Allocate a new array for the batch of output-signals.
y_shape = (batch_size, sequence_length, num_y_signals)
y_batch = np.zeros(shape=y_shape, dtype=np.float16)
# Fill the batch with random sequences of data.
for i in range(batch_size):
# Get a random start-index.
# This points somewhere into the training-data.
idx = np.random.randint(num_train - sequence_length)
# Copy the sequences of data starting at this index.
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
yield (x_batch, y_batch)
batch_size = 6
sequence_length = 12
# create the batch-generator
generator = batch_generator(batch_size=batch_size,
sequence_length=sequence_length)
# test the batch-generator to see if it works
x_batch, y_batch = next(generator)
print(x_batch.shape)
print(y_batch.shape)
(6, 12, 3075) (6, 12, 3)
x_train_scaled.shape
(47, 3075)
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, GRU, Embedding
from tensorflow.keras.optimizers import RMSprop, Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard, ReduceLROnPlateau
from tensorflow.keras.backend import square, mean
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(log_device_placement=True))
Num GPUs Available: 1 Device mapping: /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: NVIDIA GeForce RTX 3050 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
<tensorflow.python.client.session.Session at 0x252d4dc0548>
validation_data = (np.expand_dims(x_test_scaled, axis=0),
np.expand_dims(y_test_scaled, axis=0))
def loss_mse_warmup(y_true, y_pred):
"""
Calculate the Mean Squared Error between y_true and y_pred,
but ignore the beginning "warmup" part of the sequences.
y_true is the desired output.
y_pred is the model's output.
"""
# The shape of both input tensors are:
# [batch_size, sequence_length, num_y_signals].
# Ignore the "warmup" parts of the sequences
# by taking slices of the tensors.
y_true_slice = y_true[:, warmup_steps:, :]
y_pred_slice = y_pred[:, warmup_steps:, :]
# These sliced tensors both have this shape:
# [batch_size, sequence_length - warmup_steps, num_y_signals]
# Calculat the Mean Squared Error and use it as loss.
mse = mean(square(y_true_slice - y_pred_slice))
return mse
warmup_steps = 6
model = Sequential()
model.add(GRU(units=512,
return_sequences=True,
input_shape=(None, num_x_signals,)))
model.add(Dense(num_y_signals, activation='sigmoid'))
optimizer = RMSprop(lr=1e-3)
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\keras\optimizers\optimizer_v2\rmsprop.py:140: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead. super().__init__(name, **kwargs)
model.compile(loss=loss_mse_warmup, optimizer=optimizer)
model.summary()
Model: "sequential_7"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
gru_7 (GRU) (None, None, 512) 5512704
dense_7 (Dense) (None, None, 3) 1539
=================================================================
Total params: 5,514,243
Trainable params: 5,514,243
Non-trainable params: 0
_________________________________________________________________
path_checkpoint = './checkpoint.keras'
callback_checkpoint = ModelCheckpoint(filepath=path_checkpoint,
monitor='val_loss',
verbose=1,
save_weights_only=True,
save_best_only=True)
callback_early_stopping = EarlyStopping(monitor='val_loss',
patience=6, verbose=1)
callback_tensorboard = TensorBoard(log_dir='./logs/',
histogram_freq=0,
write_graph=False)
callback_reduce_lr = ReduceLROnPlateau(monitor='val_loss',
factor=0.1,
min_lr=1e-5,
patience=1,
verbose=1)
callbacks = [
callback_early_stopping,
callback_checkpoint,
callback_tensorboard,
callback_reduce_lr
]
%%time
model.fit(x=generator,
epochs=100,
steps_per_epoch=1,
validation_data=validation_data,
callbacks=callbacks)
Epoch 1/100 1/1 [==============================] - ETA: 0s - loss: 0.1348 Epoch 1: val_loss improved from inf to 6.07090, saving model to .\checkpoint.keras 1/1 [==============================] - 3s 3s/step - loss: 0.1348 - val_loss: 6.0709 - lr: 0.0010 Epoch 2/100 1/1 [==============================] - ETA: 0s - loss: 0.0994 Epoch 2: val_loss improved from 6.07090 to 5.95867, saving model to .\checkpoint.keras 1/1 [==============================] - 0s 294ms/step - loss: 0.0994 - val_loss: 5.9587 - lr: 0.0010 Epoch 3/100 1/1 [==============================] - ETA: 0s - loss: 0.1578 Epoch 3: val_loss improved from 5.95867 to 5.22216, saving model to .\checkpoint.keras 1/1 [==============================] - 0s 312ms/step - loss: 0.1578 - val_loss: 5.2222 - lr: 0.0010 Epoch 4/100 1/1 [==============================] - ETA: 0s - loss: 0.1567 Epoch 4: val_loss did not improve from 5.22216 Epoch 4: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513. 1/1 [==============================] - 0s 210ms/step - loss: 0.1567 - val_loss: 5.5323 - lr: 0.0010 Epoch 5/100 1/1 [==============================] - ETA: 0s - loss: 0.0349 Epoch 5: val_loss did not improve from 5.22216 Epoch 5: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05. 1/1 [==============================] - 0s 204ms/step - loss: 0.0349 - val_loss: 5.5397 - lr: 1.0000e-04 Epoch 6/100 1/1 [==============================] - ETA: 0s - loss: 0.0525 Epoch 6: val_loss did not improve from 5.22216 Epoch 6: ReduceLROnPlateau reducing learning rate to 1e-05. 1/1 [==============================] - 0s 204ms/step - loss: 0.0525 - val_loss: 5.5422 - lr: 1.0000e-05 Epoch 7/100 1/1 [==============================] - ETA: 0s - loss: 0.0977 Epoch 7: val_loss did not improve from 5.22216 1/1 [==============================] - 0s 196ms/step - loss: 0.0977 - val_loss: 5.5447 - lr: 1.0000e-05 Epoch 8/100 1/1 [==============================] - ETA: 0s - loss: 0.1281 Epoch 8: val_loss did not improve from 5.22216 1/1 [==============================] - 0s 198ms/step - loss: 0.1281 - val_loss: 5.5462 - lr: 1.0000e-05 Epoch 9/100 1/1 [==============================] - ETA: 0s - loss: 0.0998 Epoch 9: val_loss did not improve from 5.22216 1/1 [==============================] - 0s 201ms/step - loss: 0.0998 - val_loss: 5.5472 - lr: 1.0000e-05 Epoch 9: early stopping Wall time: 5.38 s
<keras.callbacks.History at 0x252d547b708>
result = model.evaluate(x=np.expand_dims(x_test_scaled, axis=0),
y=np.expand_dims(y_test_scaled, axis=0))
1/1 [==============================] - 0s 63ms/step - loss: 5.5472
def plot_comparison(start_idx, length=100, train=True):
"""
Plot the predicted and true output-signals.
:param start_idx: Start-index for the time-series.
:param length: Sequence-length to process and plot.
:param train: Boolean whether to use training- or test-set.
"""
if train:
# Use training-data.
x = x_train_scaled
y_true = y_train
else:
# Use test-data.
x = x_test_scaled
y_true = y_test
# End-index for the sequences.
end_idx = start_idx + length
# Select the sequences from the given start-index and
# of the given length.
x = x[start_idx:end_idx]
y_true = y_true[start_idx:end_idx]
# Input-signals for the model.
x = np.expand_dims(x, axis=0)
# Use the model to predict the output-signals.
y_pred = model.predict(x)
# The output of the model is between 0 and 1.
# Do an inverse map to get it back to the scale
# of the original data-set.
y_pred_rescaled = y_scaler.inverse_transform(y_pred[0])
# For each output-signal.
for signal in range(len(target)):
# Get the output-signal predicted by the model.
signal_pred = y_pred_rescaled[:, signal]
# Get the true output-signal from the data-set.
signal_true = y_true[:, signal]
# Make the plotting-canvas bigger.
plt.figure(figsize=(15,7))
# Plot and compare the two signals.
plt.plot(signal_true, label='true')
plt.plot(signal_pred, label='pred')
# Plot grey box for warmup-period.
p = plt.axvspan(0, warmup_steps, facecolor='black', alpha=0.15)
# Plot labels etc.
plt.ylabel(target[signal])
plt.legend()
plt.show()
plot_comparison(start_idx=0, length=60, train=True)
1/1 [==============================] - 0s 333ms/step
plot_comparison(start_idx=0, length=12, train=False)
1/1 [==============================] - 0s 294ms/step
start_idx = 0
length = 12
# Use test-data.
x = x_test_scaled
y_true = y_test
# End-index for the sequences.
end_idx = start_idx + length
# Select the sequences from the given start-index and
# of the given length.
x = x[start_idx:end_idx]
y_true = y_true[start_idx:end_idx]
# Input-signals for the model.
x = np.expand_dims(x, axis=0)
# Use the model to predict the output-signals.
y_pred = model.predict(x)
# The output of the model is between 0 and 1.
# Do an inverse map to get it back to the scale
# of the original data-set.
y_pred_rescaled = y_scaler.inverse_transform(y_pred[0])
# For each output-signal.
for signal in range(len(target)):
if target[signal] != 'CPIH_annualRate': continue
# Get the output-signal predicted by the model.
signal_pred = y_pred_rescaled[:, signal]
# Get the true output-signal from the data-set.
signal_true = y_true[:, signal]
# Make the plotting-canvas bigger.
plt.figure(figsize=(15,7))
# Plot and compare the two signals.
plt.plot(signal_true, label='true')
plt.plot(signal_pred, label='pred')
# Plot grey box for warmup-period.
p = plt.axvspan(0, warmup_steps, facecolor='black', alpha=0.15)
# Plot labels etc.
plt.ylabel(target[signal])
plt.legend()
plt.show()
1/1 [==============================] - 0s 55ms/step
np.mean((signal_true - signal_pred)**2)
56.81625605127967
pd.read_pickle('./month_agg_bert_inflation.pkl').shape
(60, 768)
month_agg_bert_inflation = pd.DataFrame(pd.read_pickle('./month_agg_bert_inflation.pkl'), index=indicators_df.index.to_list())
month_agg_bert_professionals_inflation = pd.DataFrame(pd.read_pickle('./month_agg_bert_professionals_inflation.pkl'), index=indicators_df.index.to_list())
month_agg_bert_professionals = pd.DataFrame(pd.read_pickle('./month_agg_bert_professionals.pkl'), index=indicators_df.index.to_list())
month_agg_bert = pd.DataFrame(pd.read_pickle('./month_agg_bert.pkl'), index=indicators_df.index.to_list())
pd.concat([CPI_allItems, CPIH_annualRate, CPI_energy], axis=1).to_csv('inflationIndexes.csv')
df_DL = pd.concat([CPI_allItems,
month_agg_bert_inflation, month_agg_bert_professionals_inflation,
month_agg_bert_professionals, month_agg_bert,
CPIH_annualRate_withLags, CPI_energy], axis=1)
target = ['CPI_allItems', 'CPIH_annualRate', 'CPI_energy']
# shift_days = 1
shift_steps = 1 #shift_days * 24 # Number of hours.
df_targets = df_DL[target].shift(-shift_steps)
df_DL[target].head(shift_steps + 5)
| CPI_allItems | CPIH_annualRate | CPI_energy | |
|---|---|---|---|
| 2018-01-01 | 104.5 | 2.71 | 106.6 |
| 2018-02-01 | 104.9 | 2.45 | 106.5 |
| 2018-03-01 | 105.1 | 2.29 | 105.9 |
| 2018-04-01 | 105.5 | 2.20 | 106.9 |
| 2018-05-01 | 105.9 | 2.30 | 108.9 |
| 2018-06-01 | 105.9 | 2.30 | 111.2 |
df_DL[target].head()
| CPI_allItems | CPIH_annualRate | CPI_energy | |
|---|---|---|---|
| 2018-01-01 | 104.5 | 2.71 | 106.6 |
| 2018-02-01 | 104.9 | 2.45 | 106.5 |
| 2018-03-01 | 105.1 | 2.29 | 105.9 |
| 2018-04-01 | 105.5 | 2.20 | 106.9 |
| 2018-05-01 | 105.9 | 2.30 | 108.9 |
df_targets.tail()
| CPI_allItems | CPIH_annualRate | CPI_energy | |
|---|---|---|---|
| 2022-08-01 | 122.3 | 8.81 | 173.0 |
| 2022-09-01 | 124.3 | 9.59 | 197.8 |
| 2022-10-01 | 124.8 | 9.35 | 198.1 |
| 2022-11-01 | 125.3 | 9.24 | 194.4 |
| 2022-12-01 | NaN | NaN | NaN |
# Rhe input-samples
x_data = df_DL.values[0:-shift_steps]
# The target-samples
y_data = df_targets.values[:-shift_steps]
# The number of samples in the data-set:
num_data = len(x_data)
num_data
59
# TThe fraction of the data-set that will be used for the training-set
train_split = 0.8
# The number of samples in the training-set:
num_train = int(train_split * num_data)
num_train
47
# The number of samples in the test-set:
num_test = num_data - num_train
num_test
12
# the training-samples for the training and test-sets
x_train = x_data[0:num_train]
x_test = x_data[num_train:]
# the testing-samples for the training and test-sets
y_train = y_data[0:num_train]
y_test = y_data[num_train:]
# Number of features
num_x_signals = x_data.shape[1]
num_x_signals
3081
# This is the number of targets
num_y_signals = y_data.shape[1]
num_y_signals
3
from sklearn.preprocessing import MinMaxScaler
x_scaler = MinMaxScaler()
x_train_scaled = x_scaler.fit_transform(x_train)
x_test_scaled = x_scaler.transform(x_test)
y_scaler = MinMaxScaler()
y_train_scaled = y_scaler.fit_transform(y_train)
y_test_scaled = y_scaler.transform(y_test)
print(x_train_scaled.shape)
print(y_train_scaled.shape)
(47, 3081) (47, 3)
def batch_generator(batch_size, sequence_length):
"""
Generator function for creating random batches of training-data.
"""
# Infinite loop.
while True:
# Allocate a new array for the batch of input-signals.
x_shape = (batch_size, sequence_length, num_x_signals)
x_batch = np.zeros(shape=x_shape, dtype=np.float16)
# Allocate a new array for the batch of output-signals.
y_shape = (batch_size, sequence_length, num_y_signals)
y_batch = np.zeros(shape=y_shape, dtype=np.float16)
# Fill the batch with random sequences of data.
for i in range(batch_size):
# Get a random start-index.
# This points somewhere into the training-data.
idx = np.random.randint(num_train - sequence_length)
# Copy the sequences of data starting at this index.
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
yield (x_batch, y_batch)
batch_size = 36
sequence_length = 36
# create the batch-generator
generator = batch_generator(batch_size=batch_size,
sequence_length=sequence_length)
# test the batch-generator to see if it works
x_batch, y_batch = next(generator)
print(x_batch.shape)
print(y_batch.shape)
(36, 36, 3081) (36, 36, 3)
x_train_scaled.shape
(47, 3081)
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, GRU, Embedding
from tensorflow.keras.optimizers import RMSprop, Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard, ReduceLROnPlateau
from tensorflow.keras.backend import square, mean
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(log_device_placement=True))
Num GPUs Available: 1 Device mapping: /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: NVIDIA GeForce RTX 3050 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
<tensorflow.python.client.session.Session at 0x252dfadd388>
validation_data = (np.expand_dims(x_test_scaled, axis=0),
np.expand_dims(y_test_scaled, axis=0))
def loss_mse_warmup(y_true, y_pred):
"""
Calculate the Mean Squared Error between y_true and y_pred,
but ignore the beginning "warmup" part of the sequences.
y_true is the desired output.
y_pred is the model's output.
"""
# The shape of both input tensors are:
# [batch_size, sequence_length, num_y_signals].
# Ignore the "warmup" parts of the sequences
# by taking slices of the tensors.
y_true_slice = y_true[:, warmup_steps:, :]
y_pred_slice = y_pred[:, warmup_steps:, :]
# These sliced tensors both have this shape:
# [batch_size, sequence_length - warmup_steps, num_y_signals]
# Calculat the Mean Squared Error and use it as loss.
mse = mean(square(y_true_slice - y_pred_slice))
return mse
warmup_steps = 6
model = Sequential()
model.add(GRU(units=512,
return_sequences=True,
input_shape=(None, num_x_signals,)))
model.add(Dense(num_y_signals, activation='sigmoid'))
optimizer = RMSprop(lr=1e-5)
C:\Users\AhmedOmar\anaconda3\envs\dde\lib\site-packages\keras\optimizers\optimizer_v2\rmsprop.py:140: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead. super().__init__(name, **kwargs)
model.compile(loss=loss_mse_warmup, optimizer=optimizer)
model.summary()
Model: "sequential_9"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
gru_9 (GRU) (None, None, 512) 5521920
dense_9 (Dense) (None, None, 3) 1539
=================================================================
Total params: 5,523,459
Trainable params: 5,523,459
Non-trainable params: 0
_________________________________________________________________
path_checkpoint = './checkpoint.keras'
callback_checkpoint = ModelCheckpoint(filepath=path_checkpoint,
monitor='val_loss',
verbose=1,
save_weights_only=True,
save_best_only=True)
callback_early_stopping = EarlyStopping(monitor='val_loss',
patience=6, verbose=1)
callback_tensorboard = TensorBoard(log_dir='./logs/',
histogram_freq=0,
write_graph=False)
callback_reduce_lr = ReduceLROnPlateau(monitor='val_loss',
factor=0.1,
min_lr=1e-5,
patience=1,
verbose=1)
callbacks = [
callback_early_stopping,
callback_checkpoint,
callback_tensorboard,
callback_reduce_lr
]
%%time
model.fit(x=generator,
epochs=100,
steps_per_epoch=1,
validation_data=validation_data,
callbacks=callbacks)
Epoch 1/100 1/1 [==============================] - ETA: 0s - loss: 0.0634 Epoch 1: val_loss improved from inf to 4.34723, saving model to .\checkpoint.keras 1/1 [==============================] - 2s 2s/step - loss: 0.0634 - val_loss: 4.3472 - lr: 1.0000e-05 Epoch 2/100 1/1 [==============================] - ETA: 0s - loss: 0.0286 Epoch 2: val_loss did not improve from 4.34723 1/1 [==============================] - 0s 349ms/step - loss: 0.0286 - val_loss: 4.4868 - lr: 1.0000e-05 Epoch 3/100 1/1 [==============================] - ETA: 0s - loss: 0.0305 Epoch 3: val_loss did not improve from 4.34723 1/1 [==============================] - 0s 396ms/step - loss: 0.0305 - val_loss: 4.4371 - lr: 1.0000e-05 Epoch 4/100 1/1 [==============================] - ETA: 0s - loss: 0.0271 Epoch 4: val_loss did not improve from 4.34723 1/1 [==============================] - 0s 345ms/step - loss: 0.0271 - val_loss: 4.5092 - lr: 1.0000e-05 Epoch 5/100 1/1 [==============================] - ETA: 0s - loss: 0.0268 Epoch 5: val_loss did not improve from 4.34723 1/1 [==============================] - 0s 352ms/step - loss: 0.0268 - val_loss: 4.4860 - lr: 1.0000e-05 Epoch 6/100 1/1 [==============================] - ETA: 0s - loss: 0.0259 Epoch 6: val_loss did not improve from 4.34723 1/1 [==============================] - 0s 348ms/step - loss: 0.0259 - val_loss: 4.4882 - lr: 1.0000e-05 Epoch 7/100 1/1 [==============================] - ETA: 0s - loss: 0.0230 Epoch 7: val_loss did not improve from 4.34723 1/1 [==============================] - 0s 347ms/step - loss: 0.0230 - val_loss: 4.4927 - lr: 1.0000e-05 Epoch 7: early stopping Wall time: 4.37 s
<keras.callbacks.History at 0x252df35a8c8>
result = model.evaluate(x=np.expand_dims(x_test_scaled, axis=0),
y=np.expand_dims(y_test_scaled, axis=0))
1/1 [==============================] - 0s 61ms/step - loss: 4.4927
def plot_comparison(start_idx, length=100, train=True):
"""
Plot the predicted and true output-signals.
:param start_idx: Start-index for the time-series.
:param length: Sequence-length to process and plot.
:param train: Boolean whether to use training- or test-set.
"""
if train:
# Use training-data.
x = x_train_scaled
y_true = y_train
else:
# Use test-data.
x = x_test_scaled
y_true = y_test
# End-index for the sequences.
end_idx = start_idx + length
# Select the sequences from the given start-index and
# of the given length.
x = x[start_idx:end_idx]
y_true = y_true[start_idx:end_idx]
# Input-signals for the model.
x = np.expand_dims(x, axis=0)
# Use the model to predict the output-signals.
y_pred = model.predict(x)
# The output of the model is between 0 and 1.
# Do an inverse map to get it back to the scale
# of the original data-set.
y_pred_rescaled = y_scaler.inverse_transform(y_pred[0])
# For each output-signal.
for signal in range(len(target)):
# Get the output-signal predicted by the model.
signal_pred = y_pred_rescaled[:, signal]
# Get the true output-signal from the data-set.
signal_true = y_true[:, signal]
# Make the plotting-canvas bigger.
plt.figure(figsize=(15,7))
# Plot and compare the two signals.
plt.plot(signal_true, label='true')
plt.plot(signal_pred, label='pred')
# Plot grey box for warmup-period.
p = plt.axvspan(0, warmup_steps, facecolor='black', alpha=0.15)
# Plot labels etc.
plt.ylabel(target[signal])
plt.legend()
plt.show()
plot_comparison(start_idx=0, length=60, train=True)
1/1 [==============================] - 0s 342ms/step
plot_comparison(start_idx=0, length=12, train=False)
1/1 [==============================] - 0s 327ms/step
start_idx = 0
length = 12
# Use test-data.
x = x_test_scaled
y_true = y_test
# End-index for the sequences.
end_idx = start_idx + length
# Select the sequences from the given start-index and
# of the given length.
x = x[start_idx:end_idx]
y_true = y_true[start_idx:end_idx]
# Input-signals for the model.
x = np.expand_dims(x, axis=0)
# Use the model to predict the output-signals.
y_pred = model.predict(x)
# The output of the model is between 0 and 1.
# Do an inverse map to get it back to the scale
# of the original data-set.
y_pred_rescaled = y_scaler.inverse_transform(y_pred[0])
# For each output-signal.
for signal in range(len(target)):
if target[signal] != 'CPIH_annualRate': continue
# Get the output-signal predicted by the model.
signal_pred = y_pred_rescaled[:, signal]
# Get the true output-signal from the data-set.
signal_true = y_true[:, signal]
# Make the plotting-canvas bigger.
plt.figure(figsize=(15,7))
# Plot and compare the two signals.
plt.plot(signal_true, label='true')
plt.plot(signal_pred, label='pred')
# Plot grey box for warmup-period.
p = plt.axvspan(0, warmup_steps, facecolor='black', alpha=0.15)
# Plot labels etc.
plt.ylabel(target[signal])
plt.legend()
plt.show()
1/1 [==============================] - 0s 62ms/step
np.mean((signal_true - signal_pred)**2)
39.6647107922409
X_train, X_test, y_train, y_test = indicators_train.values, indicators_test.values, inflation_train['CPIH_annualRate'].values, inflation_test['CPIH_annualRate'].values
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
learner = RandomForestRegressor()
learner.fit(X_train, y_train)
results = {}
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train)
results['MSE_train'] = mean_squared_error(y_train, predictions_train)
results['MSE_test'] = mean_squared_error(y_test, predictions_test)
results
{'acc_train': 0.06852126041666688, 'acc_test': 18.130624646666682}
predictions_train_test = list(predictions_train)
predictions_train_test.extend(list(predictions_test))
plt.figure(figsize=(15, 7))
CPIH_annualRate.plot(label='Ground Truth')
pd.Series(predictions_train_test, index=CPIH_annualRate.index).plot(label='Forecast', alpha=.7, figsize=(15, 7), color='green')
plt.vlines(indicators_test.index[0], 0, 10, colors='r', linestyles='dashed')
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show();
<Figure size 1500x700 with 0 Axes>
highlyCorrelatedFeatures = ['sumOf-veSentimentForInflationTweets', 'meanOfCompoundSentimentForInflationTweets', 'meanOfallTweetsTop15Trigrams', 'sumOf-veSentimentForAllTweets',
'sumOfallTweetsTop15Unigrams', 'meanOfinflationTweetsTop15Bigrams',
'countOfProfessionalsInflationTweets',
'sumOfinflationProfessionalTweetsTop15Unigrams',
'sumOfinflationTweetsTop15Unigrams', 'countOfProfessionalsTweets',
'countOfInflationTweets', 'sumOfinflationTweetsTop15Bigrams'
]
indicators_train_2 = indicators_train[highlyCorrelatedFeatures]
indicators_test_2 = indicators_test[highlyCorrelatedFeatures]
indicators_df[correlatedFeatures].plot(figsize=(15,7));
X_train, X_test, y_train, y_test = indicators_train_2.values, indicators_test_2.values, inflation_train['CPIH_annualRate'].values, inflation_test['CPIH_annualRate'].values
learner = RandomForestRegressor(random_state=42)
learner.fit(X_train, y_train)
results = {}
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train)
results['MSE_train'] = mean_squared_error(y_train, predictions_train)
results['MSE_test'] = mean_squared_error(y_test, predictions_test)
results
{'acc_train': 0.05474924625000018, 'acc_test': 16.330649133333367}
predictions_train_test = list(predictions_train)
predictions_train_test.extend(list(predictions_test))
plt.figure(figsize=(15, 7))
CPIH_annualRate.plot(label='Ground Truth')
pd.Series(predictions_train_test, index=CPIH_annualRate.index).plot(label='Forecast', alpha=.7, figsize=(15, 7), color='green')
plt.vlines(indicators_test.index[0], 0, 10, colors='r', linestyles='dashed')
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show();
CPIH_annualRate_withLags = pd.read_csv('GBRCPALTT01CTGYM.csv')
CPIH_annualRate_withLags = CPIH_annualRate_withLags.rename(columns={'GBRCPALTT01CTGYM': 'CPIH_annualRate'})
CPIH_annualRate_withLags['DATE'] = pd.to_datetime(CPIH_annualRate_withLags['DATE'])
CPIH_annualRate_withLags = CPIH_annualRate_withLags.set_index('DATE')
CPIH_annualRate_withLags['CPIH_annualRate_lag1'] = CPIH_annualRate_withLags['CPIH_annualRate'].shift(1)
CPIH_annualRate_withLags['CPIH_annualRate_lag2'] = CPIH_annualRate_withLags['CPIH_annualRate'].shift(2)
CPIH_annualRate_withLags['CPIH_annualRate_lag3'] = CPIH_annualRate_withLags['CPIH_annualRate'].shift(3)
CPIH_annualRate_withLags['CPIH_annualRate_lag4'] = CPIH_annualRate_withLags['CPIH_annualRate'].shift(4)
CPIH_annualRate_withLags['CPIH_annualRate_lag5'] = CPIH_annualRate_withLags['CPIH_annualRate'].shift(5)
CPIH_annualRate_withLags['CPIH_annualRate_lag6'] = CPIH_annualRate_withLags['CPIH_annualRate'].shift(6)
CPIH_annualRate_withLags = CPIH_annualRate_withLags['2018-01-01': '2022-12-31']
''
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1']]], axis=1)
train_data = all_data[:'2021-12-01']
test_data = all_data['2022-01-01':]
X_train, X_test, y_train, y_test = [
train_data.drop(columns='CPIH_annualRate').values,
test_data.drop(columns='CPIH_annualRate').values,
train_data['CPIH_annualRate'].values,
test_data['CPIH_annualRate'].values
]
learner = RandomForestRegressor()
learner.fit(X_train, y_train)
results = {}
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train)
results['MSE_train'] = mean_squared_error(y_train, predictions_train)
results['MSE_test'] = mean_squared_error(y_test, predictions_test)
display(results)
predictions_train_test = list(predictions_train)
predictions_train_test.extend(list(predictions_test))
plt.figure(figsize=(15, 7))
all_data['CPIH_annualRate'].plot(label='Ground Truth')
pd.Series(predictions_train_test, index=all_data['CPIH_annualRate'].index).plot(label='Forecast', alpha=.7, figsize=(15, 7), color='green')
plt.vlines(test_data.index[0], 0, 10, colors='r', linestyles='dashed')
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show();
{'MSE_train': 0.03789570645833362, 'MSE_test': 17.02137499000003}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2']]], axis=1)
train_data = all_data[:'2021-12-01']
test_data = all_data['2022-01-01':]
X_train, X_test, y_train, y_test = [
train_data.drop(columns='CPIH_annualRate').values,
test_data.drop(columns='CPIH_annualRate').values,
train_data['CPIH_annualRate'].values,
test_data['CPIH_annualRate'].values
]
learner = RandomForestRegressor()
learner.fit(X_train, y_train)
results = {}
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train)
results['MSE_train'] = mean_squared_error(y_train, predictions_train)
results['MSE_test'] = mean_squared_error(y_test, predictions_test)
display(results)
predictions_train_test = list(predictions_train)
predictions_train_test.extend(list(predictions_test))
plt.figure(figsize=(15, 7))
all_data['CPIH_annualRate'].plot(label='Ground Truth')
pd.Series(predictions_train_test, index=all_data['CPIH_annualRate'].index).plot(label='Forecast', alpha=.7, figsize=(15, 7), color='green')
plt.vlines(test_data.index[0], 0, 10, colors='r', linestyles='dashed')
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show();
{'MSE_train': 0.03654371562500026, 'MSE_test': 16.767443689166694}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3']]], axis=1)
train_data = all_data[:'2021-12-01']
test_data = all_data['2022-01-01':]
X_train, X_test, y_train, y_test = [
train_data.drop(columns='CPIH_annualRate').values,
test_data.drop(columns='CPIH_annualRate').values,
train_data['CPIH_annualRate'].values,
test_data['CPIH_annualRate'].values
]
learner = RandomForestRegressor()
learner.fit(X_train, y_train)
results = {}
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train)
results['MSE_train'] = mean_squared_error(y_train, predictions_train)
results['MSE_test'] = mean_squared_error(y_test, predictions_test)
display(results)
predictions_train_test = list(predictions_train)
predictions_train_test.extend(list(predictions_test))
plt.figure(figsize=(15, 7))
all_data['CPIH_annualRate'].plot(label='Ground Truth')
pd.Series(predictions_train_test, index=all_data['CPIH_annualRate'].index).plot(label='Forecast', alpha=.7, figsize=(15, 7), color='green')
plt.vlines(test_data.index[0], 0, 10, colors='r', linestyles='dashed')
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show();
{'MSE_train': 0.025803197500000236, 'MSE_test': 16.381701388333365}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
'CPIH_annualRate_lag4',
]]], axis=1)
train_data = all_data[:'2021-12-01']
test_data = all_data['2022-01-01':]
X_train, X_test, y_train, y_test = [
train_data.drop(columns='CPIH_annualRate').values,
test_data.drop(columns='CPIH_annualRate').values,
train_data['CPIH_annualRate'].values,
test_data['CPIH_annualRate'].values
]
learner = RandomForestRegressor()
learner.fit(X_train, y_train)
results = {}
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train)
results['MSE_train'] = mean_squared_error(y_train, predictions_train)
results['MSE_test'] = mean_squared_error(y_test, predictions_test)
display(results)
predictions_train_test = list(predictions_train)
predictions_train_test.extend(list(predictions_test))
plt.figure(figsize=(15, 7))
all_data['CPIH_annualRate'].plot(label='Ground Truth')
pd.Series(predictions_train_test, index=all_data['CPIH_annualRate'].index).plot(label='Forecast', alpha=.7, figsize=(15, 7), color='green')
plt.vlines(test_data.index[0], 0, 10, colors='r', linestyles='dashed')
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show();
{'MSE_train': 0.030885422708333558, 'MSE_test': 17.159176142500026}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
'CPIH_annualRate_lag4', 'CPIH_annualRate_lag5'
]]], axis=1)
train_data = all_data[:'2021-12-01']
test_data = all_data['2022-01-01':]
X_train, X_test, y_train, y_test = [
train_data.drop(columns='CPIH_annualRate').values,
test_data.drop(columns='CPIH_annualRate').values,
train_data['CPIH_annualRate'].values,
test_data['CPIH_annualRate'].values
]
learner = RandomForestRegressor()
learner.fit(X_train, y_train)
results = {}
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train)
results['MSE_train'] = mean_squared_error(y_train, predictions_train)
results['MSE_test'] = mean_squared_error(y_test, predictions_test)
display(results)
predictions_train_test = list(predictions_train)
predictions_train_test.extend(list(predictions_test))
plt.figure(figsize=(15, 7))
all_data['CPIH_annualRate'].plot(label='Ground Truth')
pd.Series(predictions_train_test, index=all_data['CPIH_annualRate'].index).plot(label='Forecast', alpha=.7, figsize=(15, 7), color='green')
plt.vlines(test_data.index[0], 0, 10, colors='r', linestyles='dashed')
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show();
{'MSE_train': 0.0288340322916668, 'MSE_test': 15.860645565833371}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
'CPIH_annualRate_lag4', 'CPIH_annualRate_lag5',
'CPIH_annualRate_lag6'
]]], axis=1)
train_data = all_data[:'2021-12-01']
test_data = all_data['2022-01-01':]
X_train, X_test, y_train, y_test = [
train_data.drop(columns='CPIH_annualRate').values,
test_data.drop(columns='CPIH_annualRate').values,
train_data['CPIH_annualRate'].values,
test_data['CPIH_annualRate'].values
]
learner = RandomForestRegressor()
learner.fit(X_train, y_train)
results = {}
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train)
results['MSE_train'] = mean_squared_error(y_train, predictions_train)
results['MSE_test'] = mean_squared_error(y_test, predictions_test)
display(results)
predictions_train_test = list(predictions_train)
predictions_train_test.extend(list(predictions_test))
plt.figure(figsize=(15, 7))
all_data['CPIH_annualRate'].plot(label='Ground Truth')
pd.Series(predictions_train_test, index=all_data['CPIH_annualRate'].index).plot(label='Forecast', alpha=.7, figsize=(15, 7), color='green')
plt.vlines(test_data.index[0], 0, 10, colors='r', linestyles='dashed')
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show();
{'MSE_train': 0.028212110208333505, 'MSE_test': 16.182123775000022}
highly_correlated_signals
['countOfAllTweets', 'sumOfCompoundSentimentForAllTweets', 'sumOf+veSentimentForAllTweets', 'sumOf-veSentimentForAllTweets', 'countOfInflationTweets', 'sumOf+veSentimentForInflationTweets', 'sumOf-veSentimentForInflationTweets', 'countOfProfessionalsTweets', 'sumOf+veSentimentForProfessionalsTweets', 'sumOf-veSentimentForProfessionalsTweets', 'countOfProfessionalsInflationTweets', 'sumOfallTweetsTop15Unigrams', 'sumOfinflationTweetsTop15Unigrams', 'sumOfinflationTweetsTop15Bigrams', 'sumOfinflationProfessionalTweetsTop15Unigrams', 'meanOfinflationTweetsTop15Bigrams']
all_data = pd.concat([indicators_df[highly_correlated_signals],
CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
'CPIH_annualRate_lag4', 'CPIH_annualRate_lag5'
]]], axis=1)
train_data = all_data[:'2021-12-01']
test_data = all_data['2022-01-01':]
X_train, X_test, y_train, y_test = [
train_data.drop(columns='CPIH_annualRate').values,
test_data.drop(columns='CPIH_annualRate').values,
train_data['CPIH_annualRate'].values,
test_data['CPIH_annualRate'].values
]
learner = RandomForestRegressor()
learner.fit(X_train, y_train)
results = {}
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train)
results['MSE_train'] = mean_squared_error(y_train, predictions_train)
results['MSE_test'] = mean_squared_error(y_test, predictions_test)
display(results)
predictions_train_test = list(predictions_train)
predictions_train_test.extend(list(predictions_test))
plt.figure(figsize=(15, 7))
all_data['CPIH_annualRate'].plot(label='Ground Truth')
pd.Series(predictions_train_test, index=all_data['CPIH_annualRate'].index).plot(label='Forecast', alpha=.7, figsize=(15, 7), color='green')
plt.vlines(test_data.index[0], 0, 10, colors='r', linestyles='dashed')
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show();
{'MSE_train': 0.0284012185416669, 'MSE_test': 16.110592104166695}
all_data = pd.concat([indicators_df['countOfInflationTweets'],
CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
'CPIH_annualRate_lag4', 'CPIH_annualRate_lag5'
]]], axis=1)
train_data = all_data[:'2021-12-01']
test_data = all_data['2022-01-01':]
X_train, X_test, y_train, y_test = [
train_data.drop(columns='CPIH_annualRate').values,
test_data.drop(columns='CPIH_annualRate').values,
train_data['CPIH_annualRate'].values,
test_data['CPIH_annualRate'].values
]
learner = RandomForestRegressor()
learner.fit(X_train, y_train)
results = {}
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train)
results['MSE_train'] = mean_squared_error(y_train, predictions_train)
results['MSE_test'] = mean_squared_error(y_test, predictions_test)
display(results)
predictions_train_test = list(predictions_train)
predictions_train_test.extend(list(predictions_test))
plt.figure(figsize=(15, 7))
all_data['CPIH_annualRate'].plot(label='Ground Truth')
pd.Series(predictions_train_test, index=all_data['CPIH_annualRate'].index).plot(label='Forecast', alpha=.7, figsize=(15, 7), color='green')
plt.vlines(test_data.index[0], 0, 10, colors='r', linestyles='dashed')
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show();
{'MSE_train': 0.022977760416666815, 'MSE_test': 14.307111895833366}
all_data = pd.concat([indicators_df[['countOfAllTweets', 'sumOfallTweetsTop15Unigrams']],
CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
'CPIH_annualRate_lag4', 'CPIH_annualRate_lag5'
]]], axis=1)
train_data = all_data[:'2021-12-01']
test_data = all_data['2022-01-01':]
X_train, X_test, y_train, y_test = [
train_data.drop(columns='CPIH_annualRate').values,
test_data.drop(columns='CPIH_annualRate').values,
train_data['CPIH_annualRate'].values,
test_data['CPIH_annualRate'].values
]
learner = RandomForestRegressor()
learner.fit(X_train, y_train)
results = {}
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train)
results['MSE_train'] = mean_squared_error(y_train, predictions_train)
results['MSE_test'] = mean_squared_error(y_test, predictions_test)
display(results)
predictions_train_test = list(predictions_train)
predictions_train_test.extend(list(predictions_test))
plt.figure(figsize=(15, 7))
all_data['CPIH_annualRate'].plot(label='Ground Truth')
pd.Series(predictions_train_test, index=all_data['CPIH_annualRate'].index).plot(label='Forecast', alpha=.7, figsize=(15, 7), color='green')
plt.vlines(test_data.index[0], 0, 10, colors='r', linestyles='dashed')
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show();
{'MSE_train': 0.02072736125000017, 'MSE_test': 14.565288057500036}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[8]],
CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
'CPIH_annualRate_lag4', 'CPIH_annualRate_lag5'
]]], axis=1)
train_data = all_data[:'2021-12-01']
test_data = all_data['2022-01-01':]
X_train, X_test, y_train, y_test = [
train_data.drop(columns='CPIH_annualRate').values,
test_data.drop(columns='CPIH_annualRate').values,
train_data['CPIH_annualRate'].values,
test_data['CPIH_annualRate'].values
]
learner = RandomForestRegressor()
learner.fit(X_train, y_train)
results = {}
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train)
results['MSE_train'] = mean_squared_error(y_train, predictions_train)
results['MSE_test'] = mean_squared_error(y_test, predictions_test)
display(results)
predictions_train_test = list(predictions_train)
predictions_train_test.extend(list(predictions_test))
plt.figure(figsize=(15, 7))
all_data['CPIH_annualRate'].plot(label='Ground Truth')
pd.Series(predictions_train_test, index=all_data['CPIH_annualRate'].index).plot(label='Forecast', alpha=.7, figsize=(15, 7), color='green')
plt.vlines(test_data.index[0], 0, 10, colors='r', linestyles='dashed')
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show();
{'MSE_train': 0.026110546875000196, 'MSE_test': 15.475714840000037}
from sklearn.linear_model import LinearRegression
def performRegression(all_data, learner):
train_data = all_data[:'2021-12-01']
test_data = all_data['2022-01-01':]
X_train, X_test, y_train, y_test = [
train_data.drop(columns='CPIH_annualRate').values,
test_data.drop(columns='CPIH_annualRate').values,
train_data['CPIH_annualRate'].values,
test_data['CPIH_annualRate'].values
]
learner.fit(X_train, y_train)
results = {}
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train)
results['MSE_train'] = mean_squared_error(y_train, predictions_train)
results['MSE_test'] = mean_squared_error(y_test, predictions_test)
display(results)
predictions_train_test = list(predictions_train)
predictions_train_test.extend(list(predictions_test))
plt.figure(figsize=(15, 7))
all_data['CPIH_annualRate'].plot(label='Ground Truth')
pd.Series(predictions_train_test, index=all_data['CPIH_annualRate'].index).plot(label='Forecast', alpha=.7, figsize=(15, 7), color='green')
plt.vlines(test_data.index[0], 0, 10, colors='r', linestyles='dashed')
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show();
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate']]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.027605024519438682, 'MSE_test': 3687.848860178068}
all_data = pd.concat([indicators_df['countOfAllTweets'], CPIH_annualRate_withLags[['CPIH_annualRate']]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.688409950335676, 'MSE_test': 10.050961049950972}
all_data = pd.concat([indicators_df['countOfInflationTweets'], CPIH_annualRate_withLags[['CPIH_annualRate']]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.5470225804068439, 'MSE_test': 3.6402945833077838}
all_data = pd.concat([indicators_df['sumOf-veSentimentForInflationTweets'], CPIH_annualRate_withLags[['CPIH_annualRate']]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.5444836524500398, 'MSE_test': 5.5477112528044925}
highlyCorrelatedFeatures = ['sumOf-veSentimentForInflationTweets', 'meanOfCompoundSentimentForInflationTweets', 'meanOfallTweetsTop15Trigrams', 'sumOf-veSentimentForAllTweets',
'sumOfallTweetsTop15Unigrams', 'meanOfinflationTweetsTop15Bigrams',
'countOfProfessionalsInflationTweets',
'sumOfinflationProfessionalTweetsTop15Unigrams',
'sumOfinflationTweetsTop15Unigrams', 'countOfProfessionalsTweets',
'countOfInflationTweets', 'sumOfinflationTweetsTop15Bigrams'
]
all_data = pd.concat([indicators_df[highlyCorrelatedFeatures], CPIH_annualRate_withLags[['CPIH_annualRate',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.2964095878153326, 'MSE_test': 4.786471086957849}
highlyCorrelatedFeatures = ['sumOf-veSentimentForInflationTweets', 'meanOfCompoundSentimentForInflationTweets', 'meanOfallTweetsTop15Trigrams', 'sumOf-veSentimentForAllTweets',
'sumOfallTweetsTop15Unigrams', 'meanOfinflationTweetsTop15Bigrams',
'countOfProfessionalsInflationTweets',
'sumOfinflationProfessionalTweetsTop15Unigrams',
'sumOfinflationTweetsTop15Unigrams', 'countOfProfessionalsTweets',
'countOfInflationTweets', 'sumOfinflationTweetsTop15Bigrams'
]
all_data = pd.concat([indicators_df[highlyCorrelatedFeatures], CPIH_annualRate_withLags[['CPIH_annualRate',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.2964095878153326, 'MSE_test': 4.786471086957849}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1']]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.0024916992458001233, 'MSE_test': 238.09221569084127}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.0012027376492995524, 'MSE_test': 1382.6481367656932}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.0011922321806102463, 'MSE_test': 1187.1204517903113}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
'CPIH_annualRate_lag4',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 4.1319161959576343e-23, 'MSE_test': 681.8996211514831}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
'CPIH_annualRate_lag4', 'CPIH_annualRate_lag5',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 1.9482587497718435e-22, 'MSE_test': 1204.3563524464432}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
'CPIH_annualRate_lag4', 'CPIH_annualRate_lag5',
'CPIH_annualRate_lag6'
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 3.7030653798607403e-22, 'MSE_test': 1168.3282593101815}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[0]],
CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
'CPIH_annualRate_lag4', 'CPIH_annualRate_lag5',
'CPIH_annualRate_lag6'
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.08080230116832929, 'MSE_test': 3.5175950182644127}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[1]],
CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag5',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.07619029113983511, 'MSE_test': 0.3539402838034082}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[2]],
CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
'CPIH_annualRate_lag4', 'CPIH_annualRate_lag5',
'CPIH_annualRate_lag6'
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.0774115074025266, 'MSE_test': 3.5749056287990104}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[3]],
CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.08260476713380856, 'MSE_test': 1.2452071391838568}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[4]],
CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.0746372016614568, 'MSE_test': 0.5070752980012113}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[5]],
CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.07456051236021065, 'MSE_test': 0.4982577201404487}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[6]],
CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag4',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.07220948879555282, 'MSE_test': 0.49350066664041653}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[7]],
CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.0700812265000217, 'MSE_test': 0.6088834834550043}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[8]],
CPIH_annualRate_withLags[['CPIH_annualRate',
'CPIH_annualRate_lag2',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.11274633101348867, 'MSE_test': 2.0823606415264204}
from sklearn.linear_model import Lasso
def performRegression(all_data, learner):
train_data = all_data[:'2021-12-01']
test_data = all_data['2022-01-01':]
X_train, X_test, y_train, y_test = [
train_data.drop(columns='CPIH_annualRate').values,
test_data.drop(columns='CPIH_annualRate').values,
train_data['CPIH_annualRate'].values,
test_data['CPIH_annualRate'].values
]
learner.fit(X_train, y_train)
results = {}
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train)
results['MSE_train'] = mean_squared_error(y_train, predictions_train)
results['MSE_test'] = mean_squared_error(y_test, predictions_test)
display(results)
predictions_train_test = list(predictions_train)
predictions_train_test.extend(list(predictions_test))
plt.figure(figsize=(15, 7))
all_data['CPIH_annualRate'].plot(label='Ground Truth')
pd.Series(predictions_train_test, index=all_data['CPIH_annualRate'].index).plot(label='Forecast', alpha=.7, figsize=(15, 7), color='green')
plt.vlines(test_data.index[0], 0, 10, colors='r', linestyles='dashed')
ax.set_xlabel('Date')
ax.set_ylabel('CPI')
plt.legend()
plt.show();
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate',
]]], axis=1)
learner = Lasso()
performRegression(all_data, learner)
{'MSE_train': 0.2746622211709813, 'MSE_test': 5.343155966735118}
highlyCorrelatedFeatures = ['sumOf-veSentimentForInflationTweets', 'meanOfCompoundSentimentForInflationTweets', 'meanOfallTweetsTop15Trigrams', 'sumOf-veSentimentForAllTweets',
'sumOfallTweetsTop15Unigrams', 'meanOfinflationTweetsTop15Bigrams',
'countOfProfessionalsInflationTweets',
'sumOfinflationProfessionalTweetsTop15Unigrams',
'sumOfinflationTweetsTop15Unigrams', 'countOfProfessionalsTweets',
'countOfInflationTweets', 'sumOfinflationTweetsTop15Bigrams'
]
all_data = pd.concat([indicators_df[highlyCorrelatedFeatures], CPIH_annualRate_withLags[['CPIH_annualRate',
]]], axis=1)
learner = Lasso()
performRegression(all_data, learner)
{'MSE_train': 0.4324037680115788, 'MSE_test': 7.18858476426368}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1']]], axis=1)
learner = Lasso()
performRegression(all_data, learner)
{'MSE_train': 0.2746622211709813, 'MSE_test': 5.343155966735118}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2',
]]], axis=1)
learner = Lasso()
performRegression(all_data, learner)
{'MSE_train': 0.2746622211709813, 'MSE_test': 5.343155966735118}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
]]], axis=1)
learner = Lasso()
performRegression(all_data, learner)
{'MSE_train': 0.2746622211709813, 'MSE_test': 5.343155966735118}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
'CPIH_annualRate_lag4',
]]], axis=1)
learner = Lasso()
performRegression(all_data, learner)
{'MSE_train': 0.2746622211709813, 'MSE_test': 5.343155966735118}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
'CPIH_annualRate_lag4', 'CPIH_annualRate_lag5',
]]], axis=1)
learner = Lasso()
performRegression(all_data, learner)
{'MSE_train': 0.2746622211709813, 'MSE_test': 5.343155966735118}
all_data = pd.concat([indicators_df, CPIH_annualRate_withLags[['CPIH_annualRate', 'CPIH_annualRate_lag1',
'CPIH_annualRate_lag2', 'CPIH_annualRate_lag3',
'CPIH_annualRate_lag4', 'CPIH_annualRate_lag5',
'CPIH_annualRate_lag6'
]]], axis=1)
learner = Lasso()
performRegression(all_data, learner)
{'MSE_train': 0.2746622211709813, 'MSE_test': 5.343155966735118}
Lags are not contributing at all to Lasso model.
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[0]],
CPIH_annualRate_withLags[['CPIH_annualRate',
]]], axis=1)
learner = Lasso()
performRegression(all_data, learner)
{'MSE_train': 0.5447343969046502, 'MSE_test': 11.492059210308762}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[1]],
CPIH_annualRate_withLags[['CPIH_annualRate',
]]], axis=1)
learner = Lasso()
performRegression(all_data, learner)
{'MSE_train': 0.6522507303193567, 'MSE_test': 17.0439893895054}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[2]],
CPIH_annualRate_withLags[['CPIH_annualRate',
]]], axis=1)
learner = Lasso()
performRegression(all_data, learner)
{'MSE_train': 0.4648607031152623, 'MSE_test': 4.525564810194617}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[3]],
CPIH_annualRate_withLags[['CPIH_annualRate',
]]], axis=1)
learner = Lasso()
performRegression(all_data, learner)
{'MSE_train': 0.45525913964609344, 'MSE_test': 6.309683593722735}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[4]],
CPIH_annualRate_withLags[['CPIH_annualRate',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.32588913993884655, 'MSE_test': 4.169985209370576}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[5]],
CPIH_annualRate_withLags[['CPIH_annualRate',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.31225492980052133, 'MSE_test': 4.920056759527158}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[6]],
CPIH_annualRate_withLags[['CPIH_annualRate',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.30193080258996213, 'MSE_test': 5.517963411740285}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[7]],
CPIH_annualRate_withLags[['CPIH_annualRate',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.24922297905711685, 'MSE_test': 7.635482497626498}
all_data = pd.concat([indicators_df[combinationOflowestMSEPerCombination[8]],
CPIH_annualRate_withLags[['CPIH_annualRate',
]]], axis=1)
learner = LinearRegression()
performRegression(all_data, learner)
{'MSE_train': 0.2489410833066664, 'MSE_test': 9.470294458615035}